This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of 
the original documents submitted by the appHcant. 

Defects in the images may include (but are not limited to): 

• BLACK BORDERS 

• TEXT CUT OFF AT TOP, BOTTOM OR SIDES 

• FADED TEXT 

• ILLEOroLE TEXT 

• SKEWED/SLANTED IMAGES 

• COLORED PHOTOS 

• BLACK OR VERY BLACK AND WHITE DARK PHOTOS 

• GRAY SCALE DOCUMENTS 

IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 
Intemaiional Bureau 




PCX 

rNTERNATTQNAL APPLICAllON PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 7 : 
F04B 1/14, 1712 



Al 



(11) International Publication Number: 
(43) International Publication Date: 



WO 00/34652 

15 June 2000 (15.06.00) 



(21) International Application Number: PCT/US99/29379 

(22) International Filing Date: 9 D&ceinber 1999 (09.12.99) 



(30) Priority Data: 
60/f 11,457 



9 December 1998 (09.12,98) US 



(71)(72) Applicant and Inventor: THILL Y, William, G. [US/US]; 
438 S. Border Road, Winchester, MA 01890 (US). 

(74) Agents: HOGLE, Doreen, M. et aL; Hamilton, Brook, Smith & 
• Reynolds, P.C., Two Militia Drive, Lexington, MA 02421 
(US). 



(81) Designated States: AE, AL, AM, AT, AU, AZ, BA, BB, BG 
BR. BY, CA, CH, CN, CR, CU, CZ, DE, DiC, DM, EE,' 
ES, FI, GB, GD, GE, GH, GM, HR. HU, ID, IL, IN, IS, JP, 
KE, KG, KP, KR, K2, LC, LK, LR. LS, LT, LU, LV, MA, 
MD, MG. MK. MN, MW, MX, NO, NZ, PL, PT, RO, RU, 
SD. SE, SG, SI, SK, SL, TJ, TM, TR. TT, TZ, UA, UG, 
US, UZ. VN, YU, ZA, ZW. ARIPO patent (GH. GM, KE, 
LS, MW, SD, SL, SZ, TZ, UG, ZW), Eurasian patent (AM, 
AZ, BY, KG, KZ, MD, RU, TJ, TM), European patent (AT, 
BE, CH. CY, DE, DK, ES, FX, FR. GB. GR, IE, IT, LU. 
MC, NL, PT, SE), OAPI patent (BF, BJ, CF, CO, CI, CM. 
GA, GN, GW, ML, MR. NE, SN, TD, TG). 



Published 

With international search report. 



(54) Title: METHODS OF IDENTIFYING POINT MUTATIONS IN A GENOME 



adsnoma 



XT; 



growth 



initiated cell 



GK-'-A-'- 

founder cell 
of carcinoma 



Ti, TjJ^ - mutation rates'^ 
a - cell division rate of 
Lethal adenoma'*' 

carcinoma P ' ^^11 death rate of 

adenoma* 
X - cell division* death 
rate of normal tissue* 

" all variables may be affected by 
exposure, xeaometaboiisn:, DNArc:;air 



(57) Abstract 

-^A-A ^ "^^^""^ *^^"^f>'^"g inherited point mutations in a targeted region of the genome in a large population 

d rSv t'^^^^ ^"'"^ "^"^^^^"^ deleterious, ham^fui or beneficial. Deleterious mutations fre^rnS 

idSLi L recognition using the set of point mutations observed in a large population of juveniles. Hamiful mutaUons ai^ 

fov£on^^ of point mutation observed in a large set of juveniles and a lar^e set of aged individuals of the same 

population, isenencial mutations are similarly identified. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCX on the front 



AL 


Albania 


ES 


Spain 


AM 


Armenia 


FI 


Finland 


AT 


Austria 


FR 


France 


AU 


Australia 


GA 


Gabon 


AZ 


Azerbaijan 


GB 


United Kingdom 


BA 


Bosnia and Herzegovina 


GE 


Georgia 


BB 


Barbados 


GH 


Ghana 


BE 


B&lgium 


GN 


Guinea 


BF 


Burkina Faso 


GR 


Greece 


BG 


Bulgaria 


HU 


Hungary 


BJ 


Benin 


IE 


Ireland 


BR 


Brazil 


IL 


Israel 


BY 


Belarus 


IS 


Iceland 


CA 


Canada 


IT 


Italy 


CF 


Central African Republic 


JP 


Japan 


CC 


Congo 


KE 


Kenya 


CH 


Switzerland 


KG 


Kyrgyzstan 


CI 


C6tc d'lvoire 


KP 


Democratic People's 


CM 


Cameroon 




Republic of Korea 


CN 


China 


KR 


Republic of Korea 


cu 


Cuba 


KZ 


Kazakstan 


cz 


Czech Republic 


LC 


Saint Lucia 


DE 


Gennany 


LI 


Liechtenstein 


DK 


Denmark 


LK 


Sri Lanfca 


EE 


Estonia 


LR 


Liberia 



Lges of pamphlets publishing international applications under the PCX. 



LS 


Lesotho 


SI 


Slovenia 


LT 


Lithuania 


SK 


Slovakia 


LU 


Luxembouj^ 


SN 


Senegal 


LV 


Latvia 


sz 


Swaziland 


MC 


Monaco 


TD 


Chad 


MD 


Republic of Moldova 


TG 


Togo 


MG 


Madagascar 


TJ 


Xajikistan 


MK 


The former Yugoslav 


TM 


Turkmenistan 




Republic of Macedonia 


TR 


Turkey 


ML 


Mali 


TT 


Trinidad and Tobago 


MN 


Mongolia 


UA 


Ukraine 


MR 


Mauritania 


VG 


Uganda 


MW 


Malawi 


US 


United States of Americj 


MX 


Mexico 


uz 


Uzbekistan 


NE 


Niger 


VN 


Viet Nam 


NL 


Netherlands 


vu 


Jugoslavia 


NO 


Norway 


zw 


Zimbabwe 


NZ 


New 2;caland 






PL 


Poland 






PT 


Portugal 






RO 


Romania 






RU 


Russian Federation 






SD 


Sudan 






SB 


Sweden 






SG 


Singapore 







wo 00/34652 PCT/US99/29379 



METHODS OF IDENTIFYING POINT MUTATIONS IN A GENOME 

RELATED .APPLICATION 

This application claims the benefit of U. S. Provisional Application No. 
60/1 1 1,457, filed December 9, 1998, the entire teachings of which are incorporated 
5 herein by reference. 

GOVERNMENT SUPPORT 

The invention was supported,, in whole or in part, by grants P30-ES02109, 
P01-ES07168, and P42-ES04675, from grants from the National Institute of 
Environmental Health Sciences, U.S.A,. The Government has certain rights in the 
1 0 invention, 

BACKGROUND OF THE im^NTION 

In the last year, the National Institutes of Health (U.S.A.) has allocated S36 
million in a 3-year program to find 100,000 human single-nucleotide polymorphisms 
(SNPs) (Masood 1999). The SNP Consortium, a group of public and private 
15 institutions, has separately committed S4 5 million in an effort to identify 300,000 SNT's 
in two years OA^ellcome, 1999). Methods currently used to identify Point miatations and 
other noint mutations include single-strand conformation polymorphism (SSCP) (Orita 
et al., 1989), restriction fragment length polymorphism (RPLP) (.^mheim et al., 1985), 
amplified-fragment length pol>morphism (AFLP) (Tunis et al., 1991), micro- and 
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amplified-fragment length polymorphism (AFLP) (Yunis et al., 1991), micro- and 
mini-satellite variation (Koreth et al, 1996), allele-specific hybridization (Shuber et aL, 
1997), denaturing gradient gel electrophoresis (DGGE) (Guldberg and Guttler 1993), 
DNA chips with detection by fluorescence or mass spectrometry (Chee et al., 1996; 
5 Cargill et al., 1999; Hacia et al., 1999, Griffm et aL, 1999; Li et al, 1999) and direct 
sequencing. Another approach is direct sequencing of the'genome., 

The techniques listed above all require assajdng each individual, or small pools 
of at most about a dozen individuals (Trulzsch et al., 1 999). Since the cost per 
individual is not trivial, this maizes mutation discovery in large populations very 
10 expensive. However, such studies are required for determination of low-frequency point 
mutations with useful statistical precision (Hagmaim, 1999). 

A need exists for a method of identifying low-frequency inherited point 
mutations in large populations. 

SUMMARY OF THE INVENTION 
15 The invention relates to a method for identifying inherited point mutations in a 

target region of a genome, comprising providing a pool of DNA fragments isolated from 
a population, and 

a) amplifying said target region of each of said fragments in a high fidelity 
polymerase chain reaction (PGR) under conditions suitable to produce double 

20 stranded DNA products which contain a terminal high temperature isomelting 

domain that is labeled with a detectable label, and where the mutant fraction of 
each PCR-induced mutation is not greater than about 5 x 10'^; 

b) melting and reannealing the product of a) under conditions suitable to form 
duplexed DNA, thereby producing a mixture of wild type homoduplexes and 

25 heteroduplexes which contain point mutations; 

c) separating the heteroduplexes from the homoduplexes based upon the 
differential melting temperatures of said heteroduplexes and said homoduplexes 



J 
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and recovering the heteroduplexes, thereby producing a second pool of DNA 
that is enriched in target regions containing point mutations; 

d) amplifying said second pool in a high fidelity PCR under condiiions where only 
homoduplexed double siranded DNA is produced, thereby producing a mixture 

5 of homoduplexed DNA containing wild type target region and homoduplexed 

DNAs which contain target regions that include pulnt -mutations; 

e) resolving the homoduplexed DNAs containing target regions which include 
point mutations based upon the differential melting temperatures of the DNAs, 
and recovering the resolved DNAs which contain a target region which includes 

1 0 point mutations; and 

f) sequencing the target region of the recovered DNAs which contain a target 
region which include point mutations. 

In panicular embodiments, the population can comprise at least about 1 000, or at least 
about 10,000 individuals. In a more panicular embodiment the population cati comprise 

15 between about 10,000 and 1,000,000 individuals. In additional embodiments, the 
population consists of members of the same demographic group, such as those of 
European ancestr^^ African ancestry, Asian ancestry or Indian ancestry. In a preferred 
embodiment, the pool of fragments is enriched in fragments containing the target 
resion. The hetsroduplexes can be separated from the homoduplexes in c), aiid the 

20 homoduplexed DNAs can be resolved in d) by constant denaniring gel capillary 
electrophoresis, constant denaturing gel electrophoresis, denaturing gradient gel 
electrophoresis or denaturing high performance liquid chromatography. The target 
region can be any desired region of the genome, such as a portion of a protein orRNA 
encoding gene. The target region can be from about 80 to about 3,000 base p airs (bp). 

25 In one embodiment, the target region is an isomelting domain. In other embodiments, 
the target region is about 80 to about 1000 bp or about 100 to about 500 bp. 

The invention also relates to a method for identifying genes which carry a 
harmful allele. In one embodiment the method comprises: 
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a) idsntiiying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, determining the 
" frequencies with which each point mutation occurs, and calculating the sunn of 
the frequency of all point mutations identified for each gene or segment; 
5 b) identifying the inherited point mutations which are found in the genes or 

ponions thereof of a population of aged individuals, determining the frequencies 
with which each point mutation occurs, and calculating the sum of the 
frequencies of all point mutations identified for each gene or segment; 
c) comparing the sum frequency of point mutations which are found in a selected 
iO gene or portion thereof of the young population calculated in a) with the sum 

frequency of point mutation which are found. in the same gene or portion thereof 
of the aged population calculated in b), wherein a significant decrease in the sum 
frequency of point mutations in the aged population indicates that said selected 
gene carries a harmful allele. 
1 5 In another embodiment, the method for identifying genes which carry a hamaful allele, 
comprises: 

a) identifyins the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, and detennining the 
frequencies with which each point mutation occurs; 

20 b) identifying the inherited point mutations w^hich are found in the genes or 
portions thereof of a population of aged individuals, and determining the 
frequency with which each point mutation occurs; and 
c) comparing the frequency of each point mutation identified in a selected gene or 
portion thereof of the young population detemiined in a) with the frequency of 

25 the same point mutations identified in said selected gene of the aged, population 

determined in b), wherein a significant decrease in the frequency of tv.'o or more 
point mutations in said selected gene of the aged population relative to said 
selected gene of the young population indicates that said selected gene carries a 
harmful allele. 
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In a particular embodiment, the method further comprises: 

d) determining the frequency of said two or more point mutations which decrease 
in the aged population in said selected gene of one or more intermediate age- 
specific populations; 

5 e) determining the age-specific decline of said two or more point mutations; and 

f) comparing the age-specific dechne determined in"e) with the theoretical age- 
specific decline of harmful alleles which cause mortal diseases, X(h,t), and 
determining if the functions are significantly different, wherein a determination 
that the age-specific decline detemiined in e) is not significantly different from 

10 the theoretical age-specific decline of harmful alleles which cause one or more 

mortal diseases further indicates that said selected gene carries a harmful allele 
and has a high probability of being causal of said one or more mortal diseases. 
In additional embodiments the invention further comprises: 

g) determining the fi"equency of said two or more point mutations which decrease 
15 in the aged population in said selected gene of one or more proband populations; 

and 

h) comparing the fi*equencies of said tVr^o or more point mutations in said selected 
gene or portion thereof in the young population with the fi-equencies of said two 
or more point mutations in said selected gene or ponion thereof in the proband 

20 populations; wherein a significant increase in the frequencies of said one or 

more point mutations in the proband population relative to the young population 
indicates that said gene caries a harmnil allele that plays a causal role in said 
disease; 

or 

25 g) determining the firequency of said two or more point mutations which decrease 
in the aged population in said selected gene of one or more proband populations 
consisting of individuals with early onset disease; and 
h) comparing the frequencies of said two or more point mutations in said selected 
gene or portion thereof in the young population with the firequencies of said two 
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or more point mutations in said selected gene or portion thereof in the proband 
populations; wherein a significant increase in the frequencies of said one or 
more point mutations in the proband population relative to the young population 
indicates that said gene carries a harmful allele which is a secondary risk factor 
which accelerates the appearance of disease. 

The invention also relates to a method for identifying genes which carry a 
harmful allele or which are linked to a gene that carries a harmful allele. In one 
embodiment the method comprises: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, and determining the 
frequency with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, and determining the 
frequency with which each point mutation occurs; 

c) comparing the frequency of each point mutation identified in a selected gene or " 
portion thereof of the young population determined in a) with the frequency of 
the same point mutations identiiled in said selected gene of the aged population 
determined in b), wherein a significant decrease in the frequency of a point 
mutation in said selected gene of the aged population relative to said selected 
gene of the young population indicates that said selected gene carries a harmful 
allele or is linked to a gene that carries a hamifui allele. 

In another embodiment, method i"or identifying genes which carry a harmful allele or 
which are lixiked to a gene that carries a harmful allele comprises: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of an early onset proband population, and determining the 
frequency with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a late onset proband population, and determining tine . 
frequency with which each point mutation occurs; 
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c) comparing the frsquencies of point mutations which are found in a selected gene 
or portion thereof in the early onset proband population with the frequencies of 
the same point mutations in said selected gene or portion thereof of the late onset 
proband populations; wherein a significant increase in the frequencies of one or 
5 more point mutations in the early onset proband population relative to the late 

onset proband population indicates that said genelSarries a harmful allele which 
is a secondary risk factor which accelerates the appearance of disease. 
The invention also relates to a method of identifying genes which carries a 
harmful allele that is a secondary risk factor that accelerates the appearance of a disease. 
10 In one embodiment the method comprises: 

a) identifying the inherited point mutations which are found in the genes or 

ponions thereof of an early onset proband population^ determining the fi-equency 
with which each point mutation occurs, and calculating the sum of the frequency 
of all point mutations identified for each gene or segment; 
15 b) identif)'ing the inherited point mutations which are found in the genes or 
portions thereof of a late onset proband population, and determining the 
frequency with which each point mutation occurs, and calculating the sum of the 
frequency of all point mutations identified for each gene or segment; 
c) comparing the sum frequency of point mutations which are found in a selected 
20 gene or portion thereof of the early onset proband population calculated in a) 

with the sum frequency of point mutation which are found in the same gene or 
portion thereof of the late onset proband population calculated in b), wherein a 
significant decrease in the sum frequency of point mutations in the late onset 
proband population indicates that said selected gene carries a harmful allele 
25 which is a secondary risk factor that accelerates the appearance of a disease. 

The invention also relates to a method for identifying genes which carry an 
allele which increases longevity, comprising: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, determining the 
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frequencies with which each point mutation occurs, and calculating the sum of 
the frequency of all point mutations identified for each gene or segment; 

b) idsntif)dng the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, determining the frequencies 

5 with- which each point mutation occurs, and calculating the sum of the 

frequencies of all point mutations identified for e^h gene or segment; 

c) comparing the sum frequency of point mutations which are found in a selected 
Rene or portion thereof of the young population calculated in a) with the sum 
frequency of point mutation which are found in the same gene or portion thereof 

10 of the aged population calculated in b), wherein a significant increase in the sum 

frequency of point mutations in the aged population indicates that said selected 
gene carries an allele which increases longevity. 
In another embodiment, the method for identifying genes which carry an allele which 
increases longevity comprises: 
15 a) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, and determining the 
frequencies with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, and determining the 

20 frequency with w-'hich each point mutation occurs; and 

c) comparing the frequency of each point mutation identified in a selected gene or 
portion thereof of the young population determined in a) with the freqiaency of 
the same point mutations identified in said selected gene of the aged population 
determined in b), wherein a significant increase in the frequency of two or more 

25 point mutations in said selected gene of the aged population relative to said 

selected gene of the young population indicates that said selected gene carries an 
allele which increases longevity. 
In a further embodiment, the method for identif^dng genes which carry an allele which 
increases longevity or which are linked to a gene that increases longevity conxprises: 
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a) identifying the inherited point mutations which are found in the genes or 
portions thereof oi" a population of young individuals, and determining the 
frequency with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
5 portions thereof of a populati on of aged individuals, and determining the 

frequency with which each point mutation occursf 

c) comparing' the frequency of each point mutation identified in a selected gene or 
portion thereof of the young population determined in a) with the frequency of 
the same point mutations identified in said selected gene of the aged population 

1 0 determined in b), wherein a significant increase in the frequency of a point 

mutation in said selected gene of the aged population relative to said selected 
gene of the young population indicates that said selected gene carries an allele 
which increases longevity or is linJced to a gene that increases longevity. 
The invention also relates to a method for identif>dng genes which affect the 

1 5 incidence of a disease, com.prising: 

a) identifying the inherited point mutations which are found in genes or portions 
thereof of a population of young individuals not afflicted wdth said disease, 
determining the frequencies with which each point mutation occurs, and 
summing the frequency of all point mutations identified in each gene or segment 

20 thereof; 

b) identifying the inherited point mutations which are found in genes or portions 
thereof of a proband population having said disease, determining the frequencies 
with which each point mutation occurs, and summing the frequency of all point 
mutations identified in each gene or segment thereof; 

25 c) comparing the sum frequency of point mutation in a selected gene or portion 
thereof in the yotmg population with the sum frequency of point mutations in 
said selected gene or portion thereof in the proband population; wherein a 
sianificant increase in the sum frequency of point mutations in the proband 
population indicates that said gene plays a causal role in said disease. - 
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The invention also relates to a method for identifying a gene which carries 
deleterious alleles. In one embodiment, the method comprises: 

a) identifying the inhsritedpoint mutations occurring in any exon(s) and splice 
sites of said gene of a population of young individuals; 

b) identifying the subset of point mutations in a) that are obligatoiy knockout point 
mutations, and determining the frequencies with vThich each obligatory knockout 
point mutation occurs; and 

c) summing the frequency of all obligatory Icnockout point mutations identified in 
the gene; wherein a sum n-equency of less than about 2% indicates that said gene 

) carries a deleterious allele. 

In another embodiment, the method comprises: 

a) identifying the inherited point mutations occurring in any exon(s) and splice 
sites of said gene of a population of young individuals; 

b) identifying the subset of point mutations in a) that are obligatory knockout point 
5 mutations, and determining the frequencies wath which each obligatory knockout 

point mutation occurs; 

c) identifying the subset of point mutations in a) that are presumptive knockout 
point mutations, and determining tlie frequencies with wbJch each presumptive 
Icnockout point mutation tjccurs: and 

0 d) summing the frequency of all of said obligatory knockout point mutations and 
presumptive knockout point mutations identified in the gene; wherein a sum 
frequency of less than about 2% indicates that said gene carries a deleterious 
allele. 

In one embodiment, a sum of about 0.02% to about 2% indicates that said gene carries a 
5 recessive deleterious allele. In another embodiment, a sum of less than about 0.02% 
indicates that said gene carries a dominant deleterious allele. A sum of grater than 2% 
suggests that the said gene canies not deleterious alleles 

' The invention also relates to a method for isolating and identifying a target 
region of a genome which contains inherited point mutations. 
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Tne invention further relates to an isolated nucleic acid which is complimentary 
to a sirand of a gene or allele thereof identified by the methods described herein. 

The invention also relates to arrays of isolated nucleic acids of the invention 
which are immobilized on a sohd support. 
5 The present invention further relates to an isolated nucleic acids and an array of 

isolated nucleic acids described herein for use in therapy Cincluding prophylaxis) or 
diagnosis, and to the use of such nucleic acids for the manufacture of a medicament for 
the treatment of a particular disease or condition as described herein. 

10 BRIEF DESCRIPTION OF THE DRAV^TNGS 

Fis.l is a schematic representation of a multistage cancer h>T30the5is for 
sporadic cancer (Herrero- Jimenez et ah, 1998) where the loss of both alleles of the first 
gatekeeper gene 'GK' and the remaining active allele of the second gatekeeper gene 'A' 
are the rate-limiting events. An inherited mutation in gene *GK' would be expected to 

15 lead to early onset famihal cancer while an inherited inactivating mutation in one allele 
of gene 'A', creates a risk of death by sporadic cancer. 

Fig. 2A is a graph that demonstrates the age-dependent pancreatic cancer 
mortahty, OBSCh^t), for European-American males bom in the 1900-1909. Solid 
squares are actiial data. The smooth line is the model suggested in Herrero- Jimenez et 

20 al., 1998. 

Fis. 2B is a graph that demonstrates the expected age-dependent surviving 
fraction at risk, X(h,t), for pancreatic cancer in this general population. 

Fies. 3A-3F are histograms which show the expected number of individuals = 
2SD with a particular polymorphism given population sizes of 1000 (Figs. 3A-3C) and 
25 10,000 (Figs. 3D-3F) for the hypothetical situations of monogenic (Figs. 3A and D), 
multigenic (n=5) (Figs. 3B.3E) and polygenic (n=2) (Figs. 3C-3F) risks in paxicreatic 
cancer analyzed for populations of 10,000 in the preceding section. It is clear by 
inspection that a sample of 1000 would permit discrimination among newborn, proband 
and centenarian populations for a polymorphism occurring at the expected frequency at 
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the 95% confidence level for the case of simple monogenic inheritance but not for 
multigenic or polygenic risk factors. The number of genes involved for each of these 
cases is denoted by n. These confidence limits apply to the case of one and only one 
point mutation compared among the three population groups. 
5 Fis. 4 is a graph that demonstrates that increasing numbers of 

European- American females (EAP) and European- -American males (E.AfvI) reach 100+ 
years of age in the US, by year (Census, 1900-1936; DHHS, 1937-1992). 

Figs. 5A and 5B are graphs of intestinal cancer age- and birth year- specific 
mortality curves for E.^JvI (Fig. 5 A) and E.AF (Fig. 5B)(Daia recorded 1930-1992). 
10 Figs. 6A and 6B are graphs of intestinal cancer age- and birth year- specific 

mortality curves for Non-European-American males (NE-AJvl)(Fig. 6A) and 
Non-European-Amsrican females primarily of African-.American descent ( NEAF) (Fig. 
6BXData recorded 1930-1992). 

Fiss. 7A and 7B are graphs which demonstrate age-specific relative sur\'ival 
15 rates of colon cancer for European-.Ajnerican females by year of diagnosis (Fig, 7A) and 
by year of birth (Fig. 73). 

Fig. 8 is a graph demonstrating the percentage of all deaths with vague 
diagnoses for European-American males of ages 50-54, 75-79, and 90-94 as a function 
of their year of birth. 

20 Figs. 9A and 9B are graphs of colon cancer age- and birtli year- specific 

mortahty ounces adjusted for historical changes in underreporting and survival rates 
among European- American males (Fig. 9A) and European- American females (Fig. 9B). 
OBS*(h,t) = OBS(h,t) - [R(h,t) (1 - S(h,t))] 

Fics. lOA and lOB are graphs of colon cancer age- and birth year- specific 

25 mortality curves adjusted for historical changes in under reponing and sundval rates 
among Non-European Americans males (Fig. lOA) and Non-European Americans 
females (Fig. lOB). OBS*(h,t) = OBS(h,t) ^ [R(h,t) (1 - S(h,t))] 

Figs. 11 A and 1 IB are graphs of the mass of riiales (Fig. 1 lA) or females (Fig. 
1 IB) as a function of age (HamiU, ?,Y.,et al.Am. /. Clin, Nutr., 32(3), 607-629 (1979). 
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Fig. 12 is a plot of the mutant fraction of the hprt locus as a function of age (n = 
740); the slope of the line is 2.1 x 10''^ hprt mutations per cell year. 

•Fis, 13 is a graph demonstrating testicular cancer age- and birth year-specific 
mortality for EAM. 

5 Fig. 14 is a Venn diagram representation of population at risk, F(h,t), as the 

intersection of the population at genetic primary risk (G) 2hd the population at 
environmental primary risk (EJ. 

Fig. 1 5 is a graph estimating paran^eters (F^, k^) and Aj, for the n=2 model in 
E.AM lS70s. 

1 0 Fig. 1 6 is a graph demonstrating the estimation of the integral of OBSR(h,t) = 

OBS(h,t) R(h,t). Open s^onbols represent extrapolations of the data used for the 
approximation. 

Fig. 17 is a graoh demonstrating the determination of (ot-B) from the slope of 
log2 A(OBS*(h,t)) At. Data used are for EAJM bom in the 1920s, ages 17.5 to 57.5. 
15 Fig. 1 8 is a schematic demonstrating a model for sporadic colon cancer for the 

case in which n=2 and m=l. Initiation is modeled as the loss of function of either APC 
• allele followed by loss of heterozygosity of the second allele. Promotion is modeled as 
loss of heterozygosity for any of a set of second gatekeeper genes. Primary genetic risk 
in this example is denned as inherited heterozygosity for any second gatekeeper gene 
20 for colon cancer. 

Figs. 19A and 19B are graphs showing the fraction of EAM, EAF, NEAl^l and 
NEAP at primary risk for colon cancer, F^, (Fig. 19 A) and the initiation mutation rate, Tj 
(Fig. 19B) as a fiinction of year of birth.. 

Fig. 20A and 20B are graphs demonstrating promotion mutation rate, r^,(Fig. 
25 20A) and the adenomatous growth rate, (a-P), (Fig 20B) for EAM, EAF, NEAM and 
NE-AF as a function of year of birth. 
; . Fig. 21 is a diagram of promotion for 'm' necessary events. 

Figs. 22A and 22B are el ectrophero grams demonstrating the sensitivity of the 
method of detecting inherited point mutations. Constant denaturing capilary 
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electrophoresis (CDCE) display oi'^ background (Fig. 22A) and MNNG-induced 
mutational spectra (Fig. 22B) in the APC gene in human MTl cells. Each numbered and 
lettered peak is a single mutant sequence which has been isolated and sequenced. Top 
panel: Mutants a - n axe all G/C--T/A transversions arising from Pfu DNA polymerase 
5 error in the background. This constitutes the "noise" from PGR. Fig 22B: Mutants 1-15 
are nearly all G/C^ A/T transitions (1, 10 and 12 are G/C-T/A transversions). Measured 
peak mutant fractions such as in peaks a-n are in the range of 0.5 to 2 xlO"^ Fractions 
of true mutants above this background are obser\^ed, isolated and sequenced as seen for 
MNNG induced mutants 1-15. 

10 Fiss. 23 A and 23B are electropherograms demonstrating the isolation of 

pBR322 Hinf I restriction digest fragments using automated fraction collection. Fig. 
23 A shov^^s the laser-induced fluorescence (LIF) detection trace of the separation of the 
fragments in one of tlie capillaries of the array. The fractions were collected in time 
ijiter\^ais depicted by the venical marks on the electropherogram. Fig. 23B shows the 

15 LIF detection traces of the fractions recovered from the collection gel well plate used in 
(Fig. 23 A) and reinjected. The off scale peak corresponds to fluorescein, used as an 
internal standard. The collected fractions were membrane-desalted prior to reinjection. 

Fig 24 is a series of electropherograms demonstrateing CDCE LIF output for 
runs of the four pooled samples used in Table 7. A sequence of the APC exon 15 is 

20 shovvTi as an example. On the left one sees the PGR primers and the intemal standard 
pealc followed by the large wild tj'pe peak and then the peak of the inherited p oint 
mutant. The mutation is a G-^T transversion at bp 8634. Ln this fonn of sample 
presentation all sequences are in the homoduplex form so a mutant is present' as a single 
peak of different melting temperature than the wild type or interal standard. This 

25 demonstrates the reproducibility of the proposed assay for inherited point mutants and 
indicates the variation which may be expected among large samples drawn from the 
same community of mixed ethjiicity (Boston). 

Fig. 25 is a series of electropherograms demonstrating the results of 
CDCE/hifiPCR analysis of pooled samples of juveniles from two ethnic backgrounds. 
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This APC sene exon 15 G-T transversion at bp 8634 is found to be significantly higher 
in the African- American than in the Hispanic-American group. 

Fieures 26A-26C are electrcpherograms obtained in a study of blood samples 
from 446 African-American juveniles which were pooled and analyzed for inherited 
5 point mutations in exon 6 and 9-bp of its adjacent splice sites. PGR background studies 
(Fig. 26A) indicated a single mutant allele would be detected at fractions of 5 x 1 0"' or 
higher. The preliminary scan in which true mutant/wild type heteroduplexes would be 
mixed with non-mutant noise peaks is shown in the second panel with a quantitative 
internal control peak (Fig. 26B). Isolation of the raw heteroduplex sample followed by 

10 high-fidehty PGR produced a clear electropherogram containing one and oiily one pair 
of peaks which could have arisen as a mutant in the original pooled sample (Fig. 26C). 
Subsequent isolation and sequencing showed this mutant to be a "wobble" transirion 
GTC GCA --GTT GCA (SEQ ID N0:1 -SEQ ID NO:2) (Val-Val) which had not been 
previously reported. It was present in approximately 6x10'-^ of the original 669 HPRT 

15 alleles or approximately 4 mutant copies. It was not present in Hispanic-American 
group (l^Iew York City) or the mixed ethnic sample of 2000 juveniles (Boston), 

Fig 27 is a fine structure map of inherited point mutations in the three exons of 
the human beta globin gene. About 10,000 alleles of the Han Chinese population were 
screened. 

20 Fig. 28 is a schematic demonstrating a multicapillary CDCE instrument with 

fraction collector. 

Fig. 29 is a melting map of the human uracil-DNA glycosylase (UNG) gene 
demonstrating the selection of target sequences that cover three exons. The melting 
profile is shown only for the first 7000 base pairs (bp) of the genomic UNG sequence. 
25 Fig. 30 illustrates the design of a synthetic DNA/RNA chimera (SEQ ID NO:3) 

to convert codon 12 of rad54 from AAG lysine to TAG stop. Upper case letters 
represent DNA nucleotides and lower case 2'-0-methyl RNA nucleotides. 
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DET.^ED DESCRIPTION OF THE INVENTION 

The invention relates to a method for identifying inherited point mutations in a 
targeted region of the genome in a large population of individuals and determining 
which inherited point mutations are deleterious, haimful or beneficial. Deleterious 
5 mutation are identified directly by a method of recognition using the set of point 

mutations observed in a large population of juveniles. Harmful mutations are identified 
by comparison of the set of point mutation observed in a large set of juveniles and a 
larse set of aged individuals of the same population. Beneficial mutations are similarly 
identiiled. 

10 As used herein, a gene defines a DNA sequence encoding proteins, RNAs or 

other physiologically functional structures which carry deleterious, harmful or 
beneficial mutations identified by the presence of point mutations within said 
deleterious, harmful or benetlcial mutations. 

The relationship between harmful point mutations in a particular disease (e.g., a 

15 mortal disease) can be discovered by a method of comparison of a theoretical age 
dependent decline of harmful alleles for all particular diseases, and the observed age 
dependent decline in harmful alleles of a particular gene. 

In one aspect, the invention is a method for isolating or identifying inherited 
point mutations in a target region of a genome. Point m.utations are mutations where a 

20 small number (e.g., 1-25 or 1-10 or 1-5 or a single base pair) of base pairs are deleted, 
added or substituted for by a different base (e.g., transitions, transversions). Inherited 
point mutations are those that are present in the genome of an individual at conception. 
Generally, the sequence of the target region is known, from pubhc databases such as 
GenBank., Suitable target sequences can be up to about 3,000 base pairs (bp) long. 

25 Target sequences can be from about 80 to about 1000 bp or about 100 to about 500 bp. 
In particular, target sequences are about 100 bp. In one embodiment the target sequence 
is an isomelting domain. Suitable target regions can be found throughout the genome. 

' " ' ■ In fact, it is desirable to identify all point mutations in the entire genome using methods 
described herein. In certain embodiments, the target region is a protein encoding gene 
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or a portion thsreof, or an RNA encoding gene or a portion thereof. For example, the 
target region can be an exon or a protein encoding gene, a regulator)' region (e.g., 
promoter), an intron, or a portion of any of the foregoing. The target region can also 
span the junction of an exon and an intron of a gene. RNA encoding genes are genes 
5 which encode RNA molecules that are not translated into proteins, such as genes 
encoding ribosomal RNAs, transfer RNAs and ribozymes: 

The method for isolating or identifying iriherited point mutations in a target 
reeion of a genome includes the steps of providing a pool of DNA fragments isolated 
from a population, and amplifying the target region. 

10 As used herein, a population refers to the set of individuals (e.g., mammals, 

humans, Homo sapiens) from which DNA samples are taken. In certain embodiments, a 
population is made up of individuals having common characteristics, such as gender, 
race, ethnicity, age, disease or disorder and the like. For example, a populatiori of 
humans can be a sample representative of all human inhabitants of Earth, or can include 

15 only individuals from the same demographic group, such as individuals of the same race 
and/or ethnicity (e.g., individuals of European ancestry, African ancestry, Asian 
ancestry, Indian ancestry, Hispanic ancestry). It is appreciated that a population can be 
made up of individuals which are a subgroup of a larger population. For example, the 
" population of Han Chinese is included in the population of individuals of Asian 

20 ancestry. 

Proband populations are made up of individuals having a panicular disease or 
disorder. Early onset proband populations are made up of the youngest 5 or 1 0% of 
individuals which develop disease, and late onset proband populations of are made up of 
the oldest 5 or 10% of individuals which develop disease. As used herein, "young 
25 individuals" includes individuals which are considered to be juveniles, and "aged 

individuals" includes the oldest about 5% of individuals. For example, a youmg human 
population includes individuals of 18 years of age or less, preferable of about 6 years of 
ase or less. An aged human population includes individuals of at least 90, preferably at 
least 98 years of age. All individuals that are not young or aged, as described herein, 
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are considered to be of intermediate age. Age-specific intermediate populations are 
made up of individuals of a desired age, such as humans ha\dng ages which fall within a 
desired interval (e.g., a five or ten year interval). 

The number of individuals included in the population can vary. Generally, the 
5 population is made up of at least 1 000 individuals. The method of the invention is well 
suited for detecting point mutations which occur at a lowlrequency (e.g., about 5 x 10'^) 
and thus permits the use of a large population. For example, the population can include 
at least about 1000 or at least about 10,000 individuals. In particular embodiments, the 
population can be made up of between about 10,000 and about 1,000,000 individuals. 

10 In a preferred em.bodiment, the population is made up of about 100,000 individuals. 

TheDNA (nuclear, mitochondriai. pooled) can be isolated from each individual 
in the population and fragmented using methods well known to those of skill in the art.. 
Generally, a sample (e.g., any DNA-containing biological sample such as a tissue 
biopsy, whole blood, isolated cells) is acquired and DNA is isolated from the cells 

1 5 contained in the sample. DNA can be isolated form a sample from an individual or 
from pooled samples. For exam.ple, DNA can be obtained by acquiring a sample of 
white blood cells of other suitable tissue sample from each individual of the population. 
Samples containing similar numbers of cells can be pooled and DNA can be isolated 
there&om. Several samples of DNA that were isolated from individuals can also be 

20 pooled. Many suitable methods for isolating DNA &om cells and/or tissues are 

available. The isolated DNA is generally fragmented by digestion with one or more 
suitable restriction endonuclease. Any restriction endonuclease which does not cleave 
within the target region of the DNA can be used. Preferably, a restriction endonuclease 
which cuts DNA with low frequency is selected, such as an enzyme with a 6 base pair 

25 recognition site ('*six-cutter"). Six-cutter enz>TOes are less likely to cut a target 

sequence than other enzymes which cut DNA with higher frequency (e.g., four-cutters) 
and convert genomic DNA into a pool of fragments averaging about 4000 bp. DNA can 
be individually digested, and the resulting fragments pooled, or a pooled sample of 
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DNA can be digested with a suitable restriction endonuclease to produce a pool of 
fragments. 

The pool of DNA fragments provides a suitable template for amplifying the 
target region of the genome. The target sequences can be amplified without further 
processing of the pool, or the pool can be processed to enrich for firagments which 
contain the target region, Preferably, the pool is enrichedfor fragments which contain 
the target region. Enrichment can be achieved using suitable methods. For example, as 
described herein, Lhe DNA fragments can be incubated with a labeled oligonucleoiide 
probe (e.g., biotin-labeled) that can hybridize to the target region or to sequences 

D flanking the target region under conditions suitable for hybridization, and the labeled 
hybrids which form can be isolated (e.g., using streptavidin-coated beads). Enrichments 
of about 10,000 fold have been achieved using this method. 

The target regions of the pool of DNA fragments are amplified in a high fidelity 
polymerase chain reaction (hifiPCR) under conditions suitable to produce double 

5 stranded DNA products which contain a terminal high temperature isomelting domain 
that is labeled with a detectable label. "HifiPCR" is a polymerase chain reaction 
performed under conditions where PCR-induced (e.g., polymerase induced) mutations 
are minimized (see, for example. U.S. Patent No. 5,976,842, the entire teachings of 
which are incorporated herein by reference). - As used herein, hifiPCR refers to a 

0 pol>'merase chain reaction wherein the mutant fraction of each PCR-induced mutation is 
not greater than about lO"*. Preferably, the mutant fraction of each PCR-induced 
mutltion is not greater than about 5 x 1 0-^ HifiPCR of target regions can be camied out 
using PfuTM polymerase where the amplification is limited to about 6 doublings. As 
described herein, the fi-equency of PCR-induced mutation at any base is depeitident upon 

5 the number of PCR doublings and the error rate per base pair per doubling. Thus, 

suitable conditions for hifiPCR can be determined for any desired polymerase and target 
sequence. 

The high temperature isomelting domain is provided by a detectably labeled 
primer which includes a 5' 40 base non-monotonous sequence with a high melting 
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temperature (e.g., a G/C rich sequence) and a 20 base target region specific sequence. 
Suitable detectable labels include, for example, a radioisotope, an affinity label (e.g., 
biotin, avidin), a spin label, a fluorescent group (e.g., fluorescein) or a 
chemi luminescent group. In a preferred embodiment, the primer is labeled at the 5' end 

5 with fluorescein. 

The products of the hifiPCR which contain wild typt or mutant (e.g., having a 
point mutation) target regions can be separated based upon differences in melting 
temperature of the wild type and mutant target regions. The products which contain 
mutant tarset regions can be recovered, thereby producing a secondary poo) of DNA 

10 that is enriched in target regions that contain point mutations. The PCR producis which 
contain wild type or mutant target regions can be separated without further processmg. 
However, it is preferred that the PCR products are processed to form a mixture of wild 
type homoduplexes and heteroduplexes which contain point mutations before 
separation. Such a mixture can readily be prepared by, for example, heating the 

15 products of the hifiPCR, thereby melting the products to forai single stranded DNAs. 
The mixture can then be cooled to allow the single stranded DNAs to anneal. Because 
the quantity of DNA strands with a wild type target region greatly exceed the quantity 
of DNA strands with a mutant target region, essentially all DNA strands with a mutant 
tareet region form heteroduplexes with Mdld type strands. 

20 A number of methods whJch are suitable for separating nucleic acids based 

upon differential melting temperatures have been described, including consta:nt 
denaturing gel capillary electrophoresis (U.S. Patent No. 5,633,129, the entire teaching 
of which are incorporated herein by reference), constant denaturing gel electrophoresis, 
denaturing gradient gel electrophoresis or denaturing high performance liquid 

25 chromatography (Gross E., et ai. Hum, Genet., 105:72-78 (1999). Preferably, the PCR 
products which contain wild type or mutated target regions, or the mixture oF wild tj'pe 
homoduplexes and heteroduplexes which contain point mutations prepared thi.erefrom, 
are separated by constant denaturing gel capillary electrophoresis and the DNTAs-;:- • 
containing mutated target regions are recovered to produce the secondary pool of DNA. 
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The secondary pool can be amplified in a high fidelity PCR under conditions 
were only homoduplexed DNA is produced, thereby producing a mixture of 
homoduplexed DNA containing wild t>^e target region and of homoduplexed DNAs 
which contain target regions that include point mutations. Tne homoduplexed DNAs 
5 which contain target regions that include point mutations can be resolved based upon 
the differential mehing temperatures of the DNAs as descTFibed and the resolved DNAs 
can be recovered. Occasionally, a mutant homoduplex can have a melting temperature 
which is nearly identical to the melting temperature of the wild t>'pe homoduplex. Such 
mutant homoduplexes can be detected by recovering the wild Xynpt homoduplex fraction, 

10 and healing and cooling tl:ie fraction to create heteroduplexes. The resulting 

heteroduplexes can be resolved and recovered. Preferably, the homoduplexed or 
heteroduplexed DNAs which contain target regions that include point mutations are 
resolved by constant denaturing gel capillary electrophoresis and the resolved. DNAs are 
recovered. The recovered DNAs can then be sequenced using ariy suitable sequencing 

15 method (e.s., cycle sequencing) to identify the point mutations. Once identified the 
frequency with which each inherited point mutation occtirs in the population can be 
determined. 

Inherited point mutations identified as described herein can be used to create a 
quantitative fine structure map of the point mutations in a-gene. Such a quantitative 

20 map can be created by determining the frequency with which each identified point 

mutation occurs in the population. The method of the invention is particularly suited for 
this purpose because rare irJierited point mutations can be detected in DNA pools 
isolated from large populations (e.g., 100,000 or even 1,000,000 individuals).. It is w^ell 
anpreciated that the accuracy and utility of such a map is greatly enhanced if it reflects 

25 the frequency of mutations in a large population. A fine structure map of a genes can be 
used to determine if the gene carries harmful, deleterious or beneficial (e.g., longevity 
promoting) alleles. 

In another aspect the irivention relates to a method for identifying genes which 
carry a harmful allele. As described herein, harmful alleles are alleles which shorten life 



wo 00/34652 



PCT/US99/29379 



span. Harmful alleles include, for example, alleles which are causal for disease (e.g., 
cancer, atherosclerosis) or which accelerate the onset of a disease, such as point 
mutations which increase somatic mutation rates. Harmful alleles are expected to be 
present at a reduced frequency in a population of aged individuals in comparison to a 
5 population of young (juvenile) individuals. Thus, genes which carry harmful alleles can 
be identified by comparing the sum frequency or the individual frequencies of inlierited 
point mutations found in the gene of an aged population to the sum frequency or the 
individual frequencies of the same inherited point mutations in a young population. The 
method comprises identifying the inherited point mutations which are found in the 

10 genes or portions thereof of a population of young individuals and in the genes or 

portions thereof of a population of aged individuals. The frequencies with which each 
point mutation occurs in each of the populations is determined. In one embodiment, the 
sum of the frequencies (sum frequency) of all inherited point mutations in a selected 
gene, or fragment thereof, of the young population is calculated. The sum frequency of 

15 all inherited point mutation in the selected gen,e or fragm.ent thereof, of the aged 

population is also calculated, and the sums are compared. A significant reduction in the 
sum frequency of inherited point mutations in a gene of the aged population relative to . 
the young population indicates that the selected gene carries a harmful allele. In another 
embodiment, the frequency of each point mutation identified in a selected gene or 

20 portion thereof of the young population is compared with the frequency of the same 
point mutation identified in the selected gene of the aged population. A signi ficant 
reduction in the frequencies of two or more inherited point mutations in a gene of the 
aged population indicates with a high probabihty that the gene carries a harmful allele, 
A sisnificant reducdon in frequency of only one inherited point mutation in a gene of 

25 the aged population indicates that the gene either carries a harmful allele or is 

genetically linked to a nearby a harmful allele. Where no significant differences in the 
sum frequency or the individual frequencies of the inherited point mutations are 
obsen^ed, the gene does not carry harmful alleles. 
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As used herein, "significant" means statistically significant. Statistical 
significance can be determined using a suitable statistical test, such as Chi square test or 
multinomial distribution, modified to account for the fact that a large number of alleles 
are being compared. For example, the statistical test can be modified by application of 

5 the Bonferoni inequality. 

Genes in which the frequency of one or more inherited point mutations decline 
in aged population in comparison to a young population can be further studied. For 
example, the age-specific decline of the one or more inherited point mutations can be 
determined. Age-specific decline can be assessed by determining the frequency of the 

1 0 one or more inherited point mutations in one or more age-specific intermediate 

populations. The frequencies of the one or more inherited point mutations in the age- 
specific intermediate population(s) together with the frequencies in the young and aged 
population demonstrate the age-specific dedine of the one or more point mutations. 
The determined age-specific decline of the one or more point mutations is compared to 

1 5 the theoretical age-specific decline of harmful alleles which cause mortal diseases (each 
mortal disease has a theoretical rate of decline). The theoretical age-specific dechne of 
harmful alleles which cause mortal disease, X(h,t) (equation 4), is calculated as 
described herein (Example 1). If the detennined age-specific rate of decline of the one 
or more point mutations is similar or essentially the same as (i.e., is not statistically 

20 different from) the theoretical age-specific decline of harmful alleles which cause a 
particular mortal disease, then the gene which carries the one or more inherited point 
mutations has a high probability of being causal for the particular disease. 

A gene in which one or more point mutations undergo age-specific decline and 
which has a high probability of being causal for the particular disease can be further 

25 studied in a suitable proband population. Generally, the proband population consists of 
individuals of various ages who have the particular disease. The frequency of the one or 
more point mutations in the proband group is determined and compared with the 
frequency of the same one or more point mutations in the young population. A 
significant increase in the frequency of the one or more point mutations in the proband 
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population in comparison to the young population indicates that the gene carries a 
harmful allele that plays a causal role in said disease. 

There are genes that cany haraifiil alleles which do not cause disease, but which 
are secondary' risk factors that accelerate the appearance of disease. Such genes can 
5 contain one or more inherited point mutations which undergo age-specific decline, and 
which hastens the appearance for the particular disease. However, a sisnificant 
association with the proband population, which consists of individuals of a variety of 
ages, may not be observed for these mutations. In such a situation, the frequency of the 
one or more inherited point mutations can be determined in an early onset proband 
1 0 population and compared with the frequency of the same one or more inherited point 
mutations in the young population. A significant increase in the frequency of the one or 
more point mutations in the early onset proband population in comparison to the young 
population indicates that the gene carries a harmful allele that is a secondar>' risk factor 
which accelerates the appearance of the disease. 
5 Genes canying harmful alleles that are secondar>' risk factors which accelerate 

the appearance of disease can also be identified by determining the frequency of each 
inherited point mutation which occurs in the genes of a population with eariy onset of a 
particular disease (eariy onset proband) and which occur in the genes of a population 

with late onset of the same disease (late onset proband). The frequencies of each 
0 inherited mutation for a selected gene of the early onset proband are compared with the 
frequencies of the same point mutations in the gene of the late proband population. A 
significant increase in the frequency of one or more inherited point mutations in the 
early onset proband relative to the late onset proband indicates that the gene carries a 
harmful allele that is secondary risk factors which accelerate the appearance of the 
5 disease 

The invention also relates to a method for identifying genes which carry an 
allele which increases longevity, The method comprises identifying the inherited point 
mutations which are found in the genes or portions thereof of a population of >^oung 
individuals and in the same gene or portion thereof in a population of aged individuals. 
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The frequencies with which each point mutation occurs in a selected gene is determined 
for each population. In one embodiment, the sum of the mutation frequencies for the 
selected sene in the young population is compared to the sum of the mutation 
frequencies for the selected gene in the aged population. A significant increase in the 
5 sum of the mutation frequencies in the aged population relative to the young population 
indicates that the selected gene carries alleles which increase longevity. In another 
embodiment, the frequency of each point mutation identified in a selected gene in the 
young population is compared to the frequency of the same point mutation identirled in 
the selected gene in the aged population. A significant increase in the frequency of two 

10 or more point mutations in a gene of the aged population relative to the young 

population indicates that the,selected gene carries an allele which increases longevity. 
In another embodiment the frequency of each point mutation identified in a selected 
gene in the young population is com.pared to the frequency of the same point mutation 
identitled in the selected gene in the aged population. A significant increase in the 

1 5 frequency of a point mutation in the gene of aged population relative to the young 
population indicates that selected gene carries an allele which increases longevity or 
which is linlced to a gene that increases longevity 

The invention also relates to a method for identifying genes which affect the 
incidence of a disease. The method comprises identifying the inherited point mutations 

20 which are found in genes or portions thereof of a population of youmg individuals not 
afflicted with the disease, and in the genes or portions thereof of a proband population 
having the disease. The frequencies with which each point mutation occurs in a selected 
gene or segment thereof is determined and summed i^'or each population. The sum 
frequency of point mutation in a selected gene or portion thereof in the young 

25 population is compared to the sum frequency of point mutations in the selected gene or 
portion thereof in the proband population. A significant increase in the sum frequency 
of point mutations in the proband population relative to the young population indicates 
that the gene plays a causal role in said disease. 
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In anothsr aspect the invention relates to a method for identifying a gene which 
carries a deleterious allelle. Deleterious alleles are alleles which interfere with 
reproduction. The genes are identified based upon the frequency of obligatory knock 
out and presumptive knock out alleles in a population. Obligatory knockouts axe point 
5 mutations which necessarily inactive the gene, such as point mutations which introduce 
stop codons or frame shifts in exons of protien encoding genes. Obligatory knockouts 
can also occur in splice sites in the exons or introns of protein encoding genes. 
Presumptive knockouts are point mutations which are expected to inactive the gene. 
For example, a point mutation that introduces a cysteine residue into a protein can form 

10 an inappropriate disulfide bond and alter the folding or cause aggregation of the protein. 
Accordingly, inherited point mutations which introduce a cysteine residue into a protein 
are considered presumptive knockouts. 

In one embodiment, the method for identifying a gene which carries a 
deleterious allelle comprises identifying the inherited point mutations occurring in the 

15 exon(s) and splice sites of a gene of a populadon of young individuals. The inherited 
point mutations that are obligatory knockouts are identified by inspection of the 
sequences, and the frequencies with which each obligatory knockout point mutation 
occurs is determined. The frequencies of all obligatory knockout point mutations 
identified in the gene are summ.ed. A sum frequency of all obligatory knockout point 

20 mutations of less than about 2% indicates that said gene carries a deleterious allele. In 
another embodiment, a sum frequency of all obhgatory knockout point mutations of 
about 0.02% to about 2% indicates that said gene carries a recessive deleterious allele. 
In anther embodiment, a smn frequency of all obligator>' knockout point mutations of 
less than about 0.02% indicates that said gene carries a dominant deleterious allele. 

25 In another embodiment, the method for identifying a gene which carries a 

deleterious allele comprises identifying the inherited point mutations occurring in the 
exon(s) and splice sites of a gene of a population of yotmg individuals. The iniierited 
■ point mutations that are obligatory knockouts and the inherited point mutations that are 
presumptive knockouts are identixled by inspection of the sequences, and the 
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frequencies with which each obligator/ or presumptive knockout point mutation occurs 
is detennined. The frequencies of all knockout point mutations identified in the gene 
(obligatory and presumptive) are summed. A sum frequency of all knockout point 
mutations of less than about 2% indicates that said gene carries a deleterious allele. In 
5 another embodiment, a sum frequency of all obligatory knockout point mutations of 
about 0.02% to about 2% indicates that said gene carries arecessive deleterious allele. 
In anther embodiment, a sum frequency of all obligatory knockout point mutations of 
less than about 0.02% indicates that said gene carries a dominant deleterious allele. 
In another aspect, the invention relates to isolated nucleic acids which are 
1 0 complimentary to a strand of a gene or portion thereof, or an allele or portion thereof 
identified according to the methods described herein. The invention also relates to 
isolated target regions of a genome which contain inherited point mutations which are 
isolated according to the methods described herein. 

Nucleic acids referred to herein as "isolated" are nucleic acids separated away 
1 5 from the nucleic acids of the genomic DNA or cellular RNA of their source of origin 

(e.g., as it exists in cells or in a mixture of nucleic acids such as a librar>'), and may have 
undergone further processing. "Isolated" nucleic acids include nucleic acids obtained 
by methods described herein, similar methods or other suitable methods, including 
essentially pure nucleic acids, nucleic acids produced by chemical synthesis" 
20 (oligonucleotide), by combinations of biological and chemical methods, and 
recombinant nucleic acids which are isolated. 

Additionally, the nucleic acid molecules of the invention can be modified at the 
base moiety, sugar moiety or phosphate backbone to improve, e.g. , the stability, 
hybndization, or solubility of the molecule. For example, the deoxjrribose phosphate 
25 backbone of the nucleic acids can be modified to generate peptide nucleic acids (see 

Hyrap ei al. (1996) Bioorganic & Medicinal Chemistiy, 4:5). As used herein, the terms 
"peptide nucleic acids" or "PNAs" refer to nucleic acid mimics, e.g., DNA mimics, in 
which the deoxyribose phosphate backbone is replaced by a pseudopeptide backbone 
and only the four natural nucleobases are retained. The neutral backbone of PNAs has 
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been sho\vn to allow for specific hybridization to DNA and RNA undsr conditions of 
low ionic strength. The synthesis of PNA oligomers can be performed using standard 
solid phase peptide s>Tithesis protocols as described in Hyrup ei al. (1996), supra; 
Peny-O'Keefe et al. (1996) Proc. Natl. Acad. Set. USA. P5:14670. PNAs can be further 
5 modified, e.g. , to enhance their stability, specificity or cellular uptake, by attaching 
lipophilic or other helper groups to PNA, by the formatioff of PNA-DNA chimeras, or 
by the use of liposomes or other techniques of drug delivery known in the art. The 
synthesis of PNA-DNA chimeras can be performed as described in E^ncup (1996), 
supra, Finn et al. (1996) Nucleic Acids Res.. 24(17):3357-63, Mag ei al. (1989) Nucleic 
1 0 Acids Res. J 7:5973, and Peterser et al. (1 975) Bioorgardc Med. Chem. Lett. J: 1 1 19. 
Other DNA minaics can comprise an array of bases which are immobilized on a sohd 
support, such as glass or sihcon. The bases are arranged in a manner which permits 
hybridization with a complimentary' nucleic acid. 

The nucleic acid molecules and fragments of the invention can also include other 
1 5 appended groups such as peptides {e.g. , for targeting host cell receptors in vivo), or 

agents facilitating transport across the cell membrane (see, e.g., Letsinger et al. (19S9) 
Proc. Natl. Acad. Sci. USA. 5(5:6553-6556; Lemaitre ei al. (1987) Proc. Natl. Acad. Sci. 
USA. 5^:648-652; PCT Publication No. WOS8/091 8) or the blood brain bamer (see, 
- - e.g., PCT Pubhcation No. WO89/10134). In addition, oligonucleotides can be modified 
20 widi hybridization-triggered cleavage agents (see, e.g. , Kiol et al. (1988) 

Bio-Techniques. 6:958-976) or intercalating agents (see, e.g., Zon {19ZB) Pharm Res.. 
5:539-549). The nucleic acids of the invention can also be modified by a detectable 
label, such as a radioisotope, spin label, antigen or enzyme label, fluorescent or 
chemiluminescent group and the like. 
25 The nucleic acids of the invention can be used as probes, for example, to detect 

the presence of a deleterious allele. Probes can be of a suitable length to ensure 
specificity in a hybridization reaction. For example, the probes can be about 15 to 
several thousand nucleotides long. Preferred probes are synthetic molecules (e.g., 
oligonucleotides) which are about 15 to about 20 or 25 nucleotides long. 
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The invention also relates to an array of isolated nucleic acids of the invention, 
immobilized on a solid support, said array having at least about 1 00 different isolated 
nucleic acids which occupy separate kno\m sites in said array, wherein each of said 
different isolated nucleic acids can specifically hybridize to a target region, gene or 
5 allele which contains an inherited point mutation. Such arrays can be prepared using 
suitable methods, such as the method described in U.S. Patent No. 5,837,832, the entire 
teaching of which are incorporated herein by reference. 

The arrays or DNA chips can be used as probes to detect harmful or deleterious 
alleles. For example, alleles which play a causal role in mortal disease (e.g., cancer) can 

10 be detected in individuals asymptomaiic of the disease. An individual diagnosed in this 
manner could then begin appropriate prophylactic therapy. The airays can be used to 
tailor chemotherapeutic inter\'ention for an individuah For example, an array of probes 
which hybridize to alleles of xenometabolic enzymes (e.g., c^aochrome P450s) can be 
used to determine an individuals abihty to metabolize certain t>'pes of drugs. 

15 Accordingly, a drug which provides superior efficacy and minimal side effects can be 
selected for therapy. The arrays can also be used to provide genetic counseling. For 
example, an array of probes which hybridize to recessive deleterious alleles can be used 
to detect such alleles in a man and in a woman hoping to have a child. If both the man 
and the wom.an carry a recessive deleterious allele of the same gene, then 0.25 of the 

20 fertilized egss produced by them v/ould be homozygous for the allele and may not be 
viable. The same array can be used to select fertiUzed eggs which ai'e not homoz>^gous 
for a deleterious allele. For example, eggs fertilized in vitro can be allowed to undergo 
cleavage, and a single cell can be removed from the egg for genetic analysis. Eggs 
which are not homozygous for a deleterious allele can be selected and implanted into a 

25 woman's uterus. 

The methods described herein can be used to identify any and all deleterious, 
harmful and beneficial point mutations carried by a particular population. Furthermore, 
a DNA chip or small set a DNA chips for detecting all deleterious, harmful and 
beneficial point mutations in all human populations and-subgroups found in North 
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America, Europe, Asia, Africa, South America can be prepared as a result of the 
invention. 

EXEMPLIFICATION 
5 EX.AMPLE 1 
Introduction 

Many inherited mutations are known to cause or predispose people to disease. 
As the human genome is mapped and sequenced, the number of markers that allow for 
genome-wide scans has increased dramatically allowing many regions linked to disease 

10 to be determined or mapped by exclusion (Nelen et al., 1996; Comuzzie et al, 1997; 
Marshall, 1997). To date, some 1339 genes associated with human disease have been 
chromosomally mapped (Genome Database, Johns Hopkins Universit}'). However, 
when any long sequences are compared between two randomly chosen alleles m the 
human population, approximately two variations are found for every 1000 bp (Rowen et 

15 al., 1997). Thus in a large human population any 1000 bp would be expected to display 
many sequences differing from the canonical sequence. Some of these sequence 
variants mav affect physiological functions such as tumor suppression, xenometabolism, 
DNA repair, or control of cell death and division. 

It seems logical that if a subpopulation, by virtue of an inherited mutation, were 

20 at risk of a form of death for which the rest of the population were not at risk, then the 
ase-dependent death rate for that subpopuladon would be increased relative to the mean, 
.^y incremental increase in the age-dependent death rate creates the expectation that 
the fraction of the human population at risk must decline with increasing age. Herein 
we consider a h>T)othetical tumor suppressor gene in which inherited gene-inactivatmg 

25 mutations would define the population at risk for a lethal disease such as pancreatic 
cancer. How we calculate our age-dependent expectations requires some explanation. 

To make our estimates we have employed Knudson's (Knudson, 1971 ) 
multistaoe model for carcinogenesis as extended by Herrero- Jimenez et al. (1998). In 
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this model, wo mutations occuning at raxes ij and r- in somaiic cells are nscessar>' to 
inactivate both copies of a first 'gatekeeper' tumor suppressor gene 'GK', resulting in 
an 'initiated' cell that will grow as an adenoma. An inactivating mutation inherited in a 
first gatekeeper gene 'GK' would almost certainly lead to the disease early in life. Ax\ 
5 example of such a condition is familial adenomatous polyposis (TAP) in which 
members of the same family develop colon cancer early iit'life because of iriierited 
mutations inactivating one allele of the adenomatous polyposis coli (-APC) gene 
(Kinzler and Vogelstein, 1996). 

We have hypothesized that as the cells continue to grow and die in a growing 

10 adenoma, the eventual stochastic loss of inherited heterozygosity in a second gatekeeper 
tumor suppressor gene 'A' occurs at rate rA. This third somatic genetic event gives rise 
to a triple mutant which by rapid growth and further genetic changes creates a lethal 
carcinoma (see Fig. 1). An inactivating mutation inherited in one allele of gene *A', the 
second gatekeeper, would not lead to disease until the two alleles of the first gatekeeper 

15 are lost and the adenoma has gro'v^^l to a size making loss of heterozygosity for gene 'A' 
in any one adenoma cell likely. A person inheriting such a mutation in 'A' might or 
might not acquire the three rate-limiting events before dying of some other cause. 
Therefore a subpopulation inheriting a mutation in either allele of gene *A' would be at 
risk of a late onset cancer. In the example below, heterozygotes in the second 

20 gatekeeper allele constitute the subpopulations at risk for 'sporadic' cancers (Fig. 1) 
(Herrero-Jimenez et al., 1998). 

The identification of genes in which polyonorphisms (usually denned as alleles 
present at an arbitrarily chosen level, typically 1% or more, of the population) represent 
a genetic risk for disease has become a very important area in human genomics 

25 (Agundez et al, 1997; Daly et al., 1997; Kaiser, 1997). Candidate genes in wfctich 

inactivating mutations could place one at an increased risk of cancer risk include tumor 
suppressor genes, xenometabolizing genes, DNA repair and repHcation genes and genes 
affecting the cellular kinerics of division and death in normal and adenomatous tissue. 
An inactivating mutation in either the first or second gatekeeper genes would create a 
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primary risk factor for familial and sporadic cancers respectively. Mutations affecring 
secondary factors, such as mutation rates or cell kinetics, would increase age-dependent 
risk but only in persons inheriting a mutation in the first gatekeeper, 'GK', or the 
second satekeeper, 'A', genes. Germinal mutations that define sporadic late onset 
5 mortality may be highly prevalent in a general population as they do not have a lethal 
effect until rate-limiting events take place, allowing inheritance of the mutant alleles. 
An example of an allele that appears to be related to late onset mortality via coronary 
heart disease is the e4 allele of the apoiipoprotein E gene which has been reported to 
decline in extremely aged populations (Kervinen et al., 1994; Schachter et ah, 1994). 

1 0 Determination of which Point mutations prevalent in the general population are 

associated with late onset mortahty requires study of large populations in order to 
achieve desirable statistical precision. Such determinations require a technology that 
can detect differences as small as a single nucleotide change in large populations at 
reasonable cost. Techniques that are currently used to identify point mutations (e.g., 

15 Point mutations) are: single-strand conformation polymorphism (SSCP) analysis (Orita 
et al, 1989), restriction fragment length pol3anorphism (RFLP) analysis (Amheim et al., 
1985), micro- and mini-satellite variation (Koreth et al., 1996), allele-specific 
hybridization-dependent techniques (Shuber et al., 1997), denaturing gradient gel 
• electrophoresis (DGGE) (Guldberg et al., 1993), DNA chips (Chee et al., 1996), and 

20 direct sequencing. 

Unfortunately, the utility of these techniques is limited by the costs and labor 
required to assay each individual'm large populations. In some techniques such as 
SSCP or DNA chips, but not DGGE, a significant fraction of possible Point-mutations 
in a defined sequence would not be detected. In the case of direct sequencing v/hich 

25 would detect any point mutation in a given sequence, cDNAs, not genomic sequences 
are generally proposed for analyses. This strategy misses Point mutations blocking 
normal mRNA splicing, an important point since a significant percentage (20-30%, 
varying among genes) of point mutations inactivating- a gene appear to be wittiin spUce 
sites of the introns (Kat, 1 992). 
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We propose a different approach based on teclmology developed to observe 
somaticaliy-derived point mutational spectra in human tissues. This technology has 
already been demonstrated to detect mutations in 100 bp sequences with a sensitivity of 
at least 10'^ (Khrapko et al., 1997). We first separate mutant from non-mutant 
5 sequences using differential cooperative melting behavior of double-stranded DNAs 
(Poland, 1978; Fischer and Lerman, 19S3). By coupling constant denamrant capillary 
electrophoresis (CDCE) with high-fidelity DNA amphfication (hifi PGR) (Khrapko et 
al., 1 994; IChrapko et ah, 1997), we could easily measure any SNP in populations of up 
to 10'' people in a single experiment within a defined target sequence. Through 

1 0 comparisons of randomized blood samples from the general US population of newborns 
with those drawn from the general US population of centenarians. Point mutations 
associated with mortal disease should be identified as they are expected to markedly 
decrease in aged populations. When proband populations afflicted with specirlc 
diseases are available for study, these identical polymorphisms are expected to be 

1 5 markedly increased relative to the newborn population. (Because we can easily measure 
Point mutations well below 1 % we designate these as point mutations, rather than cavil 
with estabUshed usage that 'polymorphism' implies a frequency of greater or equal to 
1%). 

The anal>n:ical approach combining CDCE and-hifi PCR has the ability to 
20 identify mutations that are associated with monal disease with a sensitivity and 

statistical strength which should make it significantly more desirable than proposed 
massive resequencing strategies. Examining general populations (i.e. well-mi>ced 
populations) should diminish biases such as founder effects which are encountered in 
traditional family linkage studies. In the next section we calculate the expected 
25 differences between newborn, proband and cenienarian populations for the hypothetical 
example of a eene in which inactivating mutations are a primary risk factor for 
pancreatic cancer. 

Analytical Methods 
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Fractions at risk of mortal disease: i.e. pancreatic cancer example 

In order to plan a search for disease-related Point mutations it is very useful to 
have an estimate of the total fraction at risk in the general population. To aid us with 
such estimates, we have collected the data for mortal diseases recorded in the Vital 
5 Statistics of the United States (Census, 1900-1936; DHHS, 1937-1992). Combined 
with census population data for the reporting states and counties we have been able to 
calculate a birthyear cohort- and age-specific function, OBS(h,t), which is the number of 
deaths from the observed cause at age t divided by the number of persons alive at age t 
in a cohort bom in year h. 
10 For each birthyear h, w^e posit that there exists a fraction of the population, T^, ai 

lifetime risk of the obser\'ed cause of death. Since our group is especially interested in 
discovering gene/environment interactions denning risk, Fh is modeled as the product of 
the fraction of genetically-susceptible persons multiplied by the fraction of 
environmentally-exposed persons: 

15 F, = F,,,,„,,, X F,,,,,,,^,„„ Equation 1 

If everyone were exposed to the same environmental factor(s), then 

However, if only a part of the population were exposed, then Fh,g=n=uc"'^'o^^^ 
expected to be greater than F^. Therefore, Fu is a useful estimate of the lowest possible 
fraction of F^, j,-^etic- We have been experimenting with an algebraic approach to 
calculate F^ and other parameters in the general multistage model of Fig. 1. In our 
formulation, P(t) is the probability of dying of the obsen^ed cause at age t, given that 
one is both at risk and is still alive at age t. Deaths that are not caused by the risk 
factors for the observed form of death are accounted but ^cancel out' in our derivation 
(Herrero- Jimenez et ah, 1 998). Deaths caused by the identical risk factors for the 
obsen'ed form of death are accounted by representing the fraction of the sum of the 
deaths due to these factors from the observed form of death as f We have found that 



20 



25 
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birthyear cohort- and age-specific function OBS(h,t) may be reasonably approximated 
as: 



OBS(h,t) = 



R •?([) 



,lJP(Odt 



5 Equation 2 

when the survival rates for the obser\'ed cancer is small. 

P(t), the age-dependent probability of the obsen^ed form of death within the 
group at risk, we have found to be usefully represented through algebra first suggested 
by (Moolgavkar et al., 1988), somewhat extended by (Herrero- Jimenez et al., 1 998) and 
10 further corrected (Herrero- Jimenez et aL, 1999). \2 



a 



/3}Ct.a) 

a^O ^ ^ - ^) JiquaUon 3 

where in addition to the terms denned in Fig. 1 we use a to represent age at the time of 
15 initiation, t to represent age at death from the observed cause and Na to represent the 
number of cells in a tissue at risk. 

We use X(h,t) to represent the fraction of the surviving general population still at 
risk of death from pancreatic cancer as the population ages: 

20 Any inherited mutation that creates the risk fraction should be selected against as 
each birthyear cohort ages. This means that at t = 0 , X(h,t) = F-^, and that, as the 
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population at risk dies somewhat faster than the general population X(h,i) will decrease 

with increasing age, t. 

OBSCh,t) and X(h,t) for European-American males bom between 1900 and 1909 
who died of pancreatic cancer are sho^v•n in Fig. 2A. The mortality rate is low until it 

5 rises monotonically from the early 40s, reaches a maximum at 85, and decreases to a 
much lower value for centenarians. The fraction at risk for the cohort, F„ is calculated 
to be 22%, which defines X(h.t) for the cohort at binh when t = 0. X(h,t) remains at 
approximately 22% umil age 60 then decreases to a value of about 4.4% at age 100 (see 
FiH. 2B). In short, while 22% of the population had a risk of pancreatic cancer at age 0, 

10 only some 4.4% of surviving centenarians wouid still be at risk. Note that only 3.3% of 
the population bom in 1900-09 actually died of pancreatic cancer. One of the strengths 
of using OBS(h,t) to calculate F(h) is that it does not depend on the actual fraction of a 
birth cohort dying of the observed disease. F(h) defmes the fraction of a birthyear 
cohort which would die firom pancreatic cancer in an imaginary worid in which 

15 pancreatic cancer were the only form of death, i.e., F(h) is the fraction of the cohort 
population that could potentially die from pancreatic cancer. 

2.2. Distribution of inactivating and silent mutations in genes 

In the example of pancreatic cancer, F^,, derived from the data of Fig. lA and 
Equation 2 is about 0.22. This suggests that at least 22% of the population carries a 

20 pancreatic cancer disease-related gene (monogenic hypothesis). These allelic variants 
could consist of frameshifts, stop codons, spUce site changes, inactivating missense 
mutations, large additions or deletions. Point mutations would be represented in all 
these classes of mutation except for large additions and deletions. In addition to 
expressed mutations one would expect to see a set of silent (cryptic) mutations, mostly 

25 missense or "wobble" in nature when they fall in the exons but of all kinds wh.en they 
fall in introns. 

The numerical distribution of allelic variants would be expected to contain a few 
point mutations (e.g., pobonorphisms) present at high frequencies in the population. 
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among many other point mutations present at lower frequencies. This quantitative 
distribution of mutations over a DNA sequence must also be considered when planning 
point mutation studies in populations. 

To acquaint ourselves with such a distribution of known rare point mutations 
5 inactivating a tj'pical human gene, we examined the set of all such known mutations 
inactivating the Hprt gene (in exons and neighboring splice sites) in human cells. 
Mutations in this gene and DNA-mediated tranformations were studied in the early 
1960s-by Szybaiska and Szybalski (1962). About a dozen hotspots were found to 
account for about 50% of all Hprt mutations (Kat, 1992; Kat et al, 1993). The fraction 

10 of individual mutations range from about 0.1% to greater than 10% of the total obsen^ed 
mutations, each of which inactivated the Hprt gene. Applying this example to the point 
mutations creating risk of pancreatic cancer, we would expect that at least 22% of the 
population would carry such an inactivating mutation in one allele of the gene. But we 
would expect to find the mutant as ahotspot in only 11% of the population if we 

15 inspected the sequences of exons and splice sites. Expecting these to be found in ten 
different point mutations, an average point mutation would be expected in about 1.1% 
of the newborn population. Accom.panying these hotspots which inactivate the gene 
'would be multiple mutations that have no effect on gene expression (silent or cryptic 
mutations). 

20 Silent inherited point mutations in the general population have been observed in 

the APC gene by many laboratories (Fodde et aL, 1992; Nagase et al., 1992; Powell et 
al., 1992; Dobbie et al., 1994; Groden et al., 1991). We have collected these reports 
and, in summa^y^ find that there are 15 silent polymorphic mutations distributed 
throughout the APC gene cDNA. The frequencies of each SNP range from 1 to 40% of 

25 all alleles sequenced. 

With our suggested technological approach, using 10,000 mixed blood samples, 
any rare point mutation with a frequency as low as 0.05 Vo could be observe ed with 
reasonable precision. Such a resuh would arise from 10 individuals each carrying one 
mutant allele and the 95% cor^fidence limits on such an expectation are 5-18. In a gene 
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for which mutations may lead to increased risk of death after age 50, i.e. cancer, 
diabetes, or atherosclerosis, individual silent point mutations are expected to be 
approximately at equal fractions with inactivating polymorphisms in newborns. These 
fractions of silent, point mutations would not be expected to change in probands or 
5 centenarian populations. 

2.3. Single genetic risk factors: monogenic diseases 

Some disease risks are defined by inheritance of an inactivating mutation of 
either allele of one and only one gene. We refer to these as monogenic risks and in our 
formulation, F, - F,,,euc >^ Fenv.rcnn.c.;a!- An example was illustrated m Fig. 1 in v/hich 

1 0 sene 'A' heterozygotes represents the subpopulation at risk for a sporadic cancer type. 
All gene inactivating Point mutations in gene *A' at birth would contribute to the 
estimate of F^, the fraction at risk among newborns. Using the. initial value afforded by 
the estimate of F^, and the values of X(h,t) from Fig. 2B, we have calculated the 
expected number of rare Point mutations as a function of age in randomly selected 

1 5 populations of 1 0,000 persons. 
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Newbom population bom in 1998 X(h,t) = X(1998, 0) = F, = 0.22 
1 0,000 individuals sampled 

2200 individuals would be expected to carr>' a disease-related allele 
1 1 00 individuals would be expected to cany a detectable disease-related 
5 polymorphism in the exons and splice sites of the gene 

1 1 0 individuals would have a particular detectable inactivating polymorphism at an 
average allele frequency of 5.5 x 10*^ 
■ 110 individuals would have a particular silent polymorphisinat an average allele 
frequency of 5.5 x 10'"^ 

10 Proband population X=l 

10.000 individuals sampled; all would be expected to cany a disease-related allele 

5000 individuals would be expected to carry a detectable disease-related 
po]>'morphism in the exons and splice sites of the gene 

500 individuals would have a particular detectable inactivating polymorphism at an 
15 averagc'-allele frequency of 2.5 x 10'^ 

1 10 individuals would have a particular silent pol>TTiorphism at an average allele 

frequency of 5.5 x 10'^ 

Centenarian population X(h,t) = X(l 895, 1 00+) = 0.044 

10,000 individuals sampled 
20 440 individuals would be expected to carry a disease-related allele 

220 individuals would be expected to carry a detectable disease-related 
pol>Tnorphism in the exons and splice sites of the gene 

22 individuals would have a particular detectable inactivating polymorphism at an 

average allele frequency of 1.1 x 10'^ 
25 110 individuals would have a particular silent polymorphism at an average allele 

frequency of 5.5 x 10'^ 
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2.4. Multiple independent genetic risk factors: multigenic disease 

A sinsle disease may also arise as a result of inherited mutations at any one of 
several indeDendent loci. Two examples of disease that may have similar clinical 
presentations but are actually comprised of a mutation in any one of several genes are 
5 autosomal dominant polycystic kidney disease (ADPKD) (Mochizuki et al., 1996) and 
maturity-onset diabetes of the young (MODY) (Velho ana'Froguel, 1997). 

Were inactivating mutations in any one of a set of five genes, independently 
' capable of creating a risk for pancreatic cancer, then on average, the frequency of these 
mutant hotspots would be expected to be decreased by a factor of five relative to the 
10 monogenic case considered above. We represent this case for multigenic risk by, Fj, = 

Fj^^ U F.J, 3 U F^ c--- ^^here F^^ ^^ = F g-nciic -a* ^ I'environmcniari' ' -^h^S ~ ?g=n£tic 'B ^ c7iviTonuitr.i2}'2 » 

etc. Inspection of the case of pancreatic cancer risk defined in this way illustrates that 
now mutant fraction detection as low as 0.1% is essential for a sample size on the order 
of 10,000. X(A) denotes the siir\dving fraction at risk for a disease contributed by gene 
15 *A\ one specific gene of the possible 5 genes, 'A\ 'B\ 'C\ 'D\ 'E\ in which an 
inherited inactivating allele create a risk for pancreatic cancer. One may, as in the 
example of monogenic inheritance above, calculate expected values for the nev/bom, 
proband and centenarian populadons. While we omit the arithmetic exercise here the 
results are included in Table L 

20 

2.5. Multiple interacting risk factors: polygenic disease 

A disease may be caused by the combination of inherited mutations in tv^'o or 
more genes. Such diseases appear to include schizophrenia (Portin and Alanen, 1997), 
and non-insulin dependent diabetes mellitus (GaUi et ah, 1996; Velho and Froguel, 
25 1997). We use the case of 2 separate genes and the data for pancreatic cancer as an 
illustration. In our formulation, we represent the polygenic case by F^ = F^.a n Fh,b • 
The geometrical mean of the mutant fractions, from genes 'A' and jointly creating a 
risk represented by F = 0.22 is just (0.22)1/2 or 0.47. Therefore, in our example, 47% 
would be expected to carry inactivating polymorphisms in one of genes *A' or 'B'. Tne 
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expectations for newborns, proband and centenarians can be worked out as above for a 
monogenic risk factor. Again we omit the arithmetic and provide the results in Table 1 . 

2.6. Expected values ^ 

In order to plan a study of point mutations in the general population one requires 
5 estimates of both the expected values and their dispersion 'for various possible biological 

situations such as those considered above. We provide in Table 1 our estimates of ihe 
• number of individuals along with their 95% confidence limits (± 2 standard deviations) 
in a populations of 10,000 for the situations of monogenic, multigenic and polygenic nsk 
factors for pancreatic cancer. Table 1 also indicates the associated allele frequency for 
1 0 each of the cases. These statistical limits are calculated for one point mutation coprismg 
some 10% of the set of expressed (important) mutadons. The use of several point 
mutations in each gene or the sum of all point mutations showing age-dependent 
changes greatly increases the resolving power of the studies, as taught herein. We use 
examples of five and two risk factor genes involved in multigenic disease andpolygemc 
1 5 • disease models, respectively. 
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Table LExpected number of individuals and the related allele- frequency (AF) for 
inactivating and silent polymorphisms in the calculated situations of monogenic, 
multiple independent (multigenic disease) and multiple interacting (polygenic disease) 
risk factors in pancreatic cancer. 





Monogenic : Monogsnic 
inactivating ■ silent 


iMuUigenic j Multigenic J Polygenic 
inactivating = silent | inacti vatint? 


Polygenic 
silent 


Newborn 
indi viduais 


liO±21 ! 110±21 i 22±9 i 22±9 \ ^3^-^31 ! ^35-31 
i 1 ; i 1 ' 


Proband 
individuals 


500 ±45 


i r '■ 

]iO±21 j i00±20 i 22 ±9 j 500 ± 45 
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5 2.7. Effect of population sample size 

There are a number of plans afloat to collect blood samples and study point 
mutation numbers in the general population. Some have suggested that as few as a few- 
hundred donors would be a sufficient sample to defme important polj^norphisms. Many 
have suggested 1000 as a compromise bet\\'een the recognition that 'larger samples are 

10 better' and the practical limits imposed by the costs of 'sequential sequencing' even in 
the biggest labs working on sequencing the human genome. In order to apply the 
thinking of this paper in this discussion about appropriate sample size, we here consider 
the expected variations in experiments involving 1000 vs. 10,000 blood donors drawm 
randomly from a much larger population. 

1 5 Figs. 3A-3F shows the expected number ± 2SD given population sizes of 1 000 

and 10,000 for the hypothetical situations of monogenic, multigenic (n=5) and polygenic 
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(n=2) risks in pancreatic cancer analyzed for populations of 10,000 in the preceding 
section. It is clear by inspection that a sample of 1 000 would permit discrimination 
among newborn, proband and centenarian populations for a polymorphism occurring at 
the expected frequency at the 95% confidence level for the case of simple monogenic 
5 inheritance but not for multigenic or polygenic risk factors. 

These confidence limits apply lo the case of one and only one SNP compared 
among the three groups. In a scan of 500-1000 bp of human DNA some 5-10 separate, 
■ inactivating Point mutations would be expected, greatly increasing the ability to 
recognize the involvement of the gene in a mortal disease. 

10 A further consideration relates to the population size needed to study harmful 

mutations which are also recessive deleterious mutations. In such cases, for each gene 
canyang recessive deleterious mutations, the sum of their frequencies would be about 
2%. About 1% would be obser\'-ed in scanning exons and splice sites of a protein 
encoding eene, and the frequencies of individual point mutations would be about 0.1% 

15 ' or less. In such cases, studies in populations of 100,000 or more individuals would be 
required to obser\^e statistically significant age dependent decline of harmful alleles 
which are also recessive deleterious alleles. 

2.8. Can one get 10,000 centenarian samples? 

The historical profile of survival for persons of extreme old age is plotted in Fig. 

20 4 for the US population of European Americans Females (EAF) and Males (EAM) from 
1910-1991 (for 1910-1933 only 20-48 US states that submitted data to the death registry 
are included) (Census, 1900-1936; DHHS, 1937-1992). Both the male and female 
* 100+' groups are increasing rapidly and indicate that there are more than 40,000 
centenarians alive today in the US . 

25 As shoum in Fig. 2B, the fraction still at for pancreatic cancer risk among 

European American males after the age of 100 (X(h,t) = X (1895, IOOt)) is 0.0-44. Since 
there were about -5000 such individuals alive, 5000 x 0.044 = 220 of these centenarians 
would still be at risk of death by pancreatic cancer. . 
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2.9. Consideration of blood sample size 

In order to determine allelic frequencies in a population with desirable statistical 
precision, the number of cells sampled per person must be large enough to reduce 
numerical variation and reasonably constant among all samples. One microliter of blood 
contains 1000 - 5000 white blood cells (McDonald et al., 1970; Rifkind et al., 1976). 
This is around 10-50 times more than the minimum number of 100 cells necessary to 
reduce variation for an allele present in only one person of the total pooled sample to an 
acceptable ±20%. A sam.ple size comprising 1000 cells is sufficient to reduce numerical 
variation to less than 6.3% (95% contldence limits). 

One need not create a separate mixed sam_ple of 10,000 or more blood samples 
for each DNA sequence to be studied. We have improved the technique of using biotin 
labeled probes to isolate desired restriction fragments from human genomic DNA 
samples such that a 10,000 fold purification is obtained and leaves the remaining DNA 
sample available for many further extractions for other sequences (Li-Sucholeiki and 
Thilly, 1998, in press). We estimate that 1 milliliter samples containing at least 1 
million WBCs from each of 10,000 persons coul_d provide enough material to study a set 
of some Ix 10« sequences of 100 or more bp each. This amount approaches that needed 
to study all genes in the human genome by the methods described herein. 

2.10. Effect of population composition 

Many pitfalls may be expected in the imerpretation of studies of point mutations 
as a function of age. These are generally beyond the scope of this article but apply to all 
stich studies, independent of the analytical approach used. For instance, newborns in 
1998 may not reasonably approximate the ethnic distribution of the newborns of 1 898 
from which the centenarians of 1998 are drawn. Thus if a small recent immigrant 
population represented as 1% of the newborns were to carry a particular point mutation 
in 50% of their alleles the newborn population en toto would appear to have a SNP 
fraction of 0.5%. Assuming that the allele frequency were ver>' low (<10-^) in the older 
extant population, a significant difference bet^^'een newborns and centenarians would be 
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ssen for this single point mutaiion. However, and this point cannot be overemphasized, 
our approach examines two or more rare point mutations within the same gene. For a 
conclusion that a gene has an important role in mortality, two or more different point 
mutations must decrease in the newborn/centenarians comprising, for instance, the set of 
5 all nonsense point mutations. 

2.1 L Point mutations linked to other risk denning polymorphism's 

Cohorts with disease or lack of disease may be derived from a common or 
limited number of founders in the history of the human populations. Since the causal 
allele for disease or lack of disease can remain genetically linked for tens to hundreds of 

10 generations (Myant et ah, 1997), detection of an altered allele frequency within the 

proband population could indicate either the presence of a causal point mutation, or the 
presence of a Point mutation that remains linked to an extragenic inherited mutation. 

We would like to point out that the study of multiple point mutations in the same 
eene permits discrimination between these two possibilities. A change in frequency of 

1-5 a single Point mutation among populations as a result of linkage can be differentiated 
from causality by comparing the frequencies of multiple point mutations that are 
expected to affect gene function among the study populations. If the frequency of one 
and only one Point mutation changes between any two study populations, then linkage 
must be suspected as the cause. However, if multiple inactivating Point mutations 

20 throughout the gene show similar changes in frequencies between any two populations, 
then the conclusion the gene- in question piays a causal role in mortaiit>' is justiil^ed. 

3. Discussion 

We teach hers that the use of a technology designed for in mutational 
spectrometry in human tissues, can be usefully employed in the task of defining rare 
25 point mutations which create risk for mortal disease, The technical advantages offered 
would make it possible to study sequences within any gene in samples drawn from 
10,000 or more persons. 



wo 00/34652 PCTAJS99/29379 



-46- 

We also employed our recent extension of the Knudson-Moolgavkar multistage 
• model of carcinogenesis to the data for age-dependent birth year cohort-dependent 
mortality for pancreatic cancer. We intended in this exercise to suggest ways in which 
the observation of certain point mutations decreased frequencies in aging populations 
5 could be used to identify genes involved in the pathology of cancer and other 
age-dependent diseases. 

Our suggested use of CDCE/hifi PGR in a methodical comparison of newborn, 
proband and extremely aged human populations is expected to permit identification of 
point mutations arising at frequencies as low as 0.005% with reasonable precision where 

10 100,000 persons are sampled. Multiple point mentations in the same gene, several of 
which must decrease in the extremely aged and increase in the proband, would support a 
hj'pothesis that mutations in the gene itself create a subpopulanon at risk. On the other 
hand, a decrease in the frequency of one and only. one point mutation in a gene in the 
centenarian populations, while inactivating polymorphisms remain unchanged, would 

15 suggest linkage to a nearby allele denning risk for the obsen^ed disease. An interesting 
point is that a laboratory using this technology could compare newborn to centenarian 
point mutation spectra without prior knowledge of the physiological function of a DNA 
sequence studied. A finding of several point mutations decreasing in centenariajis would 
identify that sequence as coding for a gene which affects longevity. 

20 The methodology is also applicable to nonmortal diseases by identifying 

populations at risk through other phenotypic means, such as protein expression levels or 
enzyme activities. It is clear that the determination of the genetic variants within human 
populations which define risk of early death or which can confer sensitivity to 
environmental chemicals or pharmaceuticals will play an important role in both 

25 epidemiology and environmental health research. 
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EXAKdPLE 2 
Summary 

The relationship between molecular mechanisms of mutagenesis and the actual 
processes by which most people get cancer are still poorly understood. One missing link 
IS a physiologically-based but quantitative model uniting the processes of mutation, cell 
growth and turnover. Any useful model must also account for human heterogeneity for 
inherited traits and environmental experiences. 

Such a coherent algebraic model for the age-specific incidence of canc^er has 
been developing over the past fifty years. This development has been spurred primarily 
by the efforts of Nordiing [1], Annitage and Doll [2,3] and Knudson and Moolgavkar 
[4] whose work defined two rate-limiting stages identified with initiation and promotion 
stages in experimental carcinogenesis. Unfinished m these efforts was an accounting of 
population heterogeneity and a complete description of growth and genetic ctaange 

during the growth of adenomas. 

° In an attempt to complete a unified model we present herein the first means to 
explicitly compute the essential parameters of the two-stage initiation - promotion model 
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using colon cancer as an example. With public records from the 1930s to the present day, 
we first calculate the fraction at primary risk for each birth year cohort and note 
historical changes. We then calculate the product of rates for 'n' initiation mutations, the 
product of rates for 'm' promotion mutations and the average growth rate of the 
5 intermediate adenomatous colonies from which colon carcinomas arise. 

We find that the population fraction at primary risk for colon cancer risk was 
historically invariant at about 42% for the birth year cohorts from 1860 through 1930. 
This was true for each of the four cohons we examined (European- and 
African- Americans of each gender). Additionally, the data indicate an historical increase 

1 0 in the initiation mutation rates for the male cohorts and the promotion mutation rates for 
the female cohorts. Interestingly, the calculated rates for initiation mutations are in 
accord with mutation rates derived from observations of mutations in peripheral blood 
cells drawn from persons of different ages. Adenoma growth rates differed significantly 
between genders but were essentially historically invariant. 

15 xhe model in its present form has also allowed us to calculate the rate of LOH or 

LOT in adenomas to resuh in the high LOH/LOI fractions in tumors. But it has not 
allowed us to specify the number of events, 'm' required during promotion. 

INTRODUCTION 

Colon cancer mortahty rates axe very low but rise exponentially from childhood 
20 to about age 60. The rates rise rapidly and approximately lineariy from age 60 to 85, 
reach am.aximum around age 90 and decreases significantly by ages 100-104. This is 
true both for males and females and for persons of European (Figs. 5 A and 5B) or 
non-European descent (Figs. 6A and 6B; primarily African-Americans) recorded as 
dying of intestinal cancer in the United States from 1930 to the present day. 
25 Seeking a quantitative model to account for these obsen^ations we posit that 

sporadic colon cancer arises in a subpopulation at "primary" risk as a result of inherited 
and/or envfronmental risk factors. Within this subpopulation, we apply the concepts of a 
three-staee carcinogenesis process: initiation, promotion and progression. Initiation is 
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modeled as the accumulation of 'n' genetic events in any normal cell, giving rise to the 
first cell of an adenoma. In the sunaving, slowly growing, adenoma, we model, 
promotion as the acquisition of 'm* genetic events transforming any adenoma cell into a 
carcinoma cell. Progression is assumed to have a duration of less than three ysars and is 
5 modeled as occurring in zero years. Thus, the fact that colon cancer rates are low until 
late middle age is interpreted as the time required for a slowly growing adenoma to 
acquire one cell with the necessary genetic event(s) for growth as a carcinoma. The 
delay to onset of disease would be the average time necessary^ for an adenoma to produce 
a carcinoma. 

1 0 The monotonic rise in cancer rates from childhood to age 80 is seen as a natural 

increase in the number of initiated cells as a function of age from binh. The early 
exponential rise is interpreted as a result of both the exponential increase of parenchjmial 
cells from infancy to puT:)erty and the exponential increase in an adenoma's cell' number 
with rime. The linear increase to high cancer rates is seen as a result of a linear increase 

15 in iniriated cells as a function of age after puberty and a consequence of the constant 
number of stem cells iiradults. These concepts, although somewhat extended by us [5], 
are common to the traditions of quantitarive modeling of the carcinogenesis process. 

We departed from previous work in our use of the maximum in the age-specific 
cancer mortality rates. This apparent maximum in the cancer mortality rate in old age 

20 was initially recognized but dismissed as an error of diagnosis and/or reporting in the 
elderly [1-4]. Cook, Doll, and Fillingham [8], however, specifically reasoned that a true 
maximum in the age-dependent mortality rate would be expected if there were a distinct 
subpopulation at risk. By virtue of cancer risk, such a subpopulation v/ould have a higher 
overall death rate than a subpopulation which had no risk of cancer. As a birth year 

25 cohort aged, there would be a smaller remaining fraction at risk and thus the observed 
cancer mortality rate in the surviving population could reach a maximum and decline. 

' r - They also considered it possible that the apparent maximum and subsequent decline 
resulted from errors of diagnosis and reporting, but noted that "Visual inspection of the 
graphs show that when curvature is present, it usually occurs throughout the wtiole range 
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of ages examined, and no relationship was found bem^een the amount of do^^Tl^va^d 
cun^ature and the difficulty of diagnosis..." 

We returned to this question with a much larger data set, permitting analysis of 
birth year cohort specific age dependence of recorded cancer mortality for the entire 
American population of European and African descent over a period of more than sixty 
years [9, 10], We have the advantage of observing mortality rates in the extremely aged 
in the last few decades in which the advent of government-sponsored health care for the 
elderly has reduced uncertainty as to the cause of death. These recent data confirm ihe 
suspicion of Cook et al. that a true maximum in cancer death rates exists. We have built 
on this confirmation to derive a more comprehensive model for age-specific cancer 
mortality rates in human populations. 

We explicitly address the fact that the mortality rates are dependent on the 
effectiveness of treatment and on the accuracy of post-mortem diagnoses. We have 
assessed the former by examining the historical changes in the reported survival rates for 
colon cancer. With regard to underreporting, it is true that early in the twentieth century 
many deaths in the most elderly were reported as deaths by "old age" or "senility". 
However, the fraction of total deaths with such uninformative diagnoses has decreased 
steadily in the extremely aged throughout this century [9,10]. As in the case of sui-v-ival 
rates, we have used historical records on the number of uninformative diagnoses to 
estimate the level of underreporting for each age and birth year cohort. 
We also have the advantages created by new understanding of the genetic changes 
leading to cancers in humans. We are persuaded that the somatic genetic evidence 
supports a model in w^hJch the loss of two active APC alleles is sufficient and necessary 
for initiation of most sporadic colon cancers [7, 11]. However, our derived model is 
general for "n" initiation mutations to permit facile testing of other hypotheses. 
Curiously, many mutations recorded in malignant tumors, such as in the ras 
proto-oncogenes or the TP53 gene, do not appear to be a part of the initiation or 
promotion processes as they appear to arise in sectors, but not the totality, of tumors in 
which they are measured. The absence of these mutations in every carcinoma cell 
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suggests that they are not among the rate-limiiing steps in nonmal tissue cells or 
adenomas which define the age-specific mortality rates modeled here. These 
post-adenomatous mutations may be considered important steps in tumor progression 
which, for our purposes, is considered a relatively rapid process of less than three years 
5 duration. 

Building on our quantitative model for promotion we have also addressed the 
fact that early colon carcinomas and adenomas demonstrate marked loss of 
heterozygosity (LOH) for informative loci on all chromosomes, on average 22Yo [12-17]. 
In a single study, loss of genomic imprinting '(LOI) in colorectal carcinomas for a single 
10 marker was found to be 44% [IS]. We use our model for 'm' promotion mutations to 
explore plausible mechanisms to account for these high LOH and LOI fractions. 

We have placed most mathematical derivations in an appendix but have not 
hesitated to place the algebraic statements necessary to understanding the logic of the 
model in the text along with explanations of their physical meanings. The primary data 
15 sets from which all of our calculations arise are available through our website 
http://cehs4.mit,edu for the use of all researchers. 

MATERI.^S : Population Data and Known Physiological Parameters 
I. Primary Population Data Sets 
A. Mortality Data 

Annual age-specific mortality data for the US population were obtained from the US 
Department of Health and Human Services (Vital Statistics of the United States, 
1937-1992) [9] and the U.S. Bureau of the Census (Mortality Statistics, 1900-1936) 
[10]. Population values were provided by the Duke Center for Demographic Studies for 
the years 1950 to 1992. For previous years, we derived population estimates directly 
from census counts for those„ states and counties reporting to the national death 
registries. Combined, these data sets permitted calculation of the age specific mortality 
rate for each birth year, "h*', designated OBS(h',t) for "OBServed" colon cancer mortality 
rate. 



20 



25 
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(Eq. 5) OBS(h,t) = recorded deaths from iniestinal cancgr from birth cohon h at age t 

recorded population size from birth cohort h at age t 

Figs 5A, 5B, 6A and 6B summarized the age-specific colon cancer mortality 
records for the birth years betVr^een 1840 and 1930 for European and 
Non-European-Americans respectively. The age-specific mortality data were grouped in 
5 5-year intervals 0-4, 5-9, . . 90-94, 95-99, 100-r. The mean age of these groups is 

approximiately 2.5, 7.5, . . 97.5 and 102.5 and accounts for our use of these ages herein. 
To simplify presentation of data, we have summed deaths and populadons of individuals 
bom in the same decade. The 1 S30s denotes persons bom in 1 830-1839 and so forth. 

Intestinal cancer records are available since 1930, as opposed to 1958 for colon 
1 0 cancer when specific diagnosis became available. We used these data to approximate 
colon cancer deaths; deaths by cancer of the small intestine represented only 3% of the 
total number of deaths from intestinal cancer in the period during which colon cancer 
was specifically recorded [9]. 

The population numbers for individuals described as "nonwhite" would include 
1 5 persons of Native American and Asian heritage as w^ell as African heritage. However, 
the U.S. demographics during the period when the temi "non-white" was used as a 
descriptive term was more than 75% of African descent [9]. 

B. Sur\''ival Data & Estimates 

For each birth year cohort h and each age t within each cohort there is an associated 
20 relative 5-year survival rate for colon cancer S(h,t). S(h,t) represents the probability of 
surviving those causes linlced to colon cancer. 



. (Eq. 6) 

^ t) = (recorded colon cancer survivors at age t-!-5, diagnosed at agg t) 

(recorded diagnoses of colon cancer at age t) x (survival rate for all forms of death, age t -f 5) 
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Here, we use the relative survival rate rather than the observed survival rate, recognizing 
that individuals who were diagnosed with colon cancer could have died of an unrelated 
form of death within the 5-year period after diagnosis. 

Naturally, improvements in colon cancer diagnosis and therapy have contributed 
5 to a historically increasing value of S(h,t). An important comphcation, however, is that 
at ages greater than 75 there is a diminishing probabilit>^ that either early diagnosis will 
be accomplished and/or that rigorous forms of therapy will be applied. Unfortunately, 
the data set for these values is incomplete and our effort proceeds with some 
approximations. 

10 Eisenberg et al [19] have summarized the age-specific relative survival rates for 

1935 to 1959 within the state of Connecticut for both males and females up to ages 
65-74, and estimated for ages greater than 75. The NCI Monograph No. 6 [20] 
summarized similar relative survival rates for 1950 to 1957 for both males and females, 
including both a larger set of hospital registries, and relative survival rates for untreated 

15 individuals. The Cancer Patient Survival Report Number 5 [21] extended this work to 
the period of 1 950 to .1972, including survival rates for patients of both European and 
African descent. Gloeckler et al [22] similarly reports the relative survival rates for the 
period of 1973 to 1975 by age, gender, and race. Last, the SEER Cancer Statistics 
Reviews [23-25] have recorded the 5-year relative survival rates for 1983 through 1991. 
20 Beart et al [26] have reported the age-specific relative survival rates for 1983, although 
gender and race were not specified. Beart's overall survival estimates for the early 1980s 
were about 10% lower than as reported by SEER [25]. Consequently, for the 19SOs, we 
chose to use SEER's reported survival rates decreased by 5% to represent the average of 
SEER and Beart et al. 

25 Reported survival rates did not account for those deaths of individuals first 

diagnosed with cancer at the time of death. The percentage of ^incidences at autopsy' is 
1-2% for the 1990s [personal communication, L.A.G. Ries, SEER]. For diagnostic years 
1935-79 the percentage of 'incidences at autopsy' for the state of Connecticut was 
generally 1-3%, but was shown to increase as a function of age [27]. 
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Survival rates are approximately constant between ages 40 and 75 in recent 
decades [22-25]. However, Beart [26] extended the survival data to 80 years and found 
survival rates decreased significantly from 70-79 and >80 years of age. Sun'ivai rates 
appear to decrease even further in extreme old age when colon cancer is more often 
detected in an advanced stage. In persons over 80 years of age, only 2.4% of all colon 
tumors were treated by surger>' and chemotherapy, compared to 26.3% for persons less 
than 50 year olds [26]. As an estimate, we use a survival rate for centenarians of 3-4%, 
the survival rate for untreated tumors for 75-f year olds [20]. We interpolate beUveen this 
estimate and the other data to approximate the survival for any unspecified ages. 

Figs 7A and 7B illustrate these points for Europe an- American females bom 
between the lS40s and 1930s. The data recorded by year of diagnosis in Figure 7A are 
converted into age-specific relative survival values by year of birth in Figure 7B. V^Tiere 
values were unknown, estimates were interpolated. Survival rates reported for the age 
ranges "under 45" and "above 75", are plotted at ages 40 and 80, respectively, as these 
are approximately the average ages of individuals dying of colon cancer in these 
cohorts. S(h,t) increases steadily with historical time which creates a steady age specific 
increase in S(h,t) for any particular birth year cohort, but which still decreases markedly 
in extreme old age. 

Table 2 summarizes the relative survival rates by year of diagnosis, usiiag 
averages for those years for which more than one value were available. Averages were 
weighted according to the numbers of patients examined in each study. To estimate 
relative sundval rates for Non-European- Americans, we referred to the 
Am can- American survival rate data set for the 1950s through the 1990s, as 
African-Americans comprised more than 75% of the Non-European population. 
Estimates for the 1940s and 1930s cohorts of Non-Europeans were interpolated 
assuming that the change in the survival rate for European- Americans was proportional 
to the change in the survival rate of Non-European-Americans during this period. No 
estimates were allowed to drop below the reported ^survival rates for untreated 
individuals [20]. 
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Table 2. Summar>' of the relative survival rates by year of diagnosis 



1990s A^es: 


0-44 


45-54 


55-64 


65-74 


75+ 


100+ 


EAM 


0.58 


0.62 


0.65 


0.66 


0.59 


0.03 


PAP 




0.65 


0.62 


0,64 


0.6O 


0.04 


NEAM 


0.51 


0.54 


0.55 


0.52 


0.45 


0.03 


NEAF 


0.55 


0.53 


0.56 


0.53 


0.46 


0.04 


1980s Ases; 


0-44 


45-54 


55-64 


65-74 


75-1- 


100+ 


EAM 


0.49 


0.59 


0.59 


0.60 


0.57 


0.03 


■cat; 


U.Jo 


n "56 

\j .mJ\J 


0 56 


0.57 


0.55 


0.04- 


NEAM 


0.44 


0.49 


0.49 


0.47 


0.37 


0.03 


NEAF 


0.52 


0.54 


0.53 


0.43 


0.41 


0.04 


1970s Aees: 


-0-44 


45-54 


55-64 


65-74 


75-r 


100+ 


EAM 


0.47 


0,48 


0.48 


0.48 


0.4-4 


0.03 


£Ar 


U.J £5 




U. JU 


0 48 


0.4-6 


0.04 


NEAM 


0.42 


0.46 


0.45 


0,38 • 


0.32 


0.03 


NEAF 


-0.53 


0.50 


0.45 ■ 


0.50 


0.37 


0.04 


1960s A?es: 


0-44 


45-54 


55-64 


65-74 


75-1- 


100+ 


EAM 


0.50 


0.45 


. 0.45 


0,44 


0.37 


0.03 • 


EAF 


r\ c.'~\ 

0.50 


U.40 




n Ai 


0.4-2 


0.04 


NEAM 


-0.29 


0.42 


0.31 


0.29 


0.25 


0.03 


NEAF 


0.36 


0.46 


0.38 


0.30 


0.34 


0.04 


1950s Aees: 


0-44 


45-54 


55-64 


65-74 


75 + 


100+ 


EAM 


0.42 


0,46 


0.40 


0.38 


0.32 


0.03 


EAF 


0.46 


0.46 


0.46 


■ 0.42 


0.38 


0.04 


NEAM 


0.28 


0.37 


0.25 


0.32 


0.18 


0.03 


K'P A P 




0 36 

\Jtmf\J 


0 33 


0.24 


0. 15 


0.04 


1940s Aees: 


0-44 


45-54 


55-64 


65-74 


• 75+ 


100+ 


EAM 


0.27 


0.33 


0.29 


0.21 


0. 17 


0.03 


EAF- 


0.34 


0.34 


0.35 


0.28. 


0.24 


0.04 


NEAM 


0.18 


0.26 


0.18 


0.18 


0.O9 


0.03 


NEAF 


0.30 


0.27 


0.25 


0.16 


0. 10 


0.04 


1930s Aees: 


0-44 


45-54 


55-64 


65-74 


75+ 


100+ 


EAM. 


0.30 


0.27 


0.20 


0.09 


O.OO 


0.03 


EAF 


0.25 


0.17 


0.18 


0.11 


0.07 


0.04 


NEAM 


0.20 


0.22 


0.12 


0.07 


0.O4 


0.03 


NEAF 


0.22 


0.13 


0.13 


0.07 


O.04 


0.04 


Untreated* Aees: 


0-44 


45-54 


55-64 


65-74 


7 5-r 




Males 


0.00 


0.10 


0.11 


0.07 


O.03 




Females 


0.11 


0.07 


0.07 


0.07 


O.04 





* Reported as Other and Untreated [20] 
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C. Reporting Data: Estimates of Error 

It is obvious that the numerator defining OBS(hjt) in Equation 5 will be affected 
by the probability that an actual colon cancer mortality is recorded as such. It is equally 
obvious that there are no records of inadequate diagnosis per se. In our previous attempt 
5 [5], we faced this problem with regard to the completeness of the mortality records for 
the extremely aged and found that we could improve our estimate of mortality to some 
extent by noting the number of deaths in a cohort without any adequate diagnosis as a 
function of age. For instance, in centenarians we noted this number was about 20% in 
the 1930s for European-.Ajnerican males but decreased to less than 5% by the 1 950s [5]. 
10 Thus we have inspected the historical record for the number of deaths with vague or 
unrecorded diagnoses for all ages, genders, and ethnic groups, for each birth year cohort 
analyzed. These data create a matrix for each demographic group defining an upper 
estimate of the probability of accurately recording the cause of death as the function 
R(h.t). 

15 (Eq.7) R(h,t)=: recorded deaths from spscified causes from birth cohort h at age t 

all recorded deaths from birth cohort h at age t 

Fig. 8 shows the percentage of all deaths with vague diagnoses plotted as a function of 
the birth year for several age groups of European-.Amierican males. The assumption here 
20 is that the proportion of colon cancer deaths am.ong all deaths with unrecorded diagnoses 
is about the same as the proportion of colon cancers among all deaths with recorded 
diagnoses. 

Application of this assumption still underestimates the tme colon cancer 
mortality fraction. Since about 50% of present deaths are recorded as due to 
25 cardiovascular or cerebrovascular causes, sm.all overestimates in these diagnoses would 
lead to large underestimates of mortality from any other specific disease. 
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Furthermore, a diagnosis of colon cancer may be in error, paniculariy if a mass 
in the colon were a secondary tumor from another organ. This kind of error has been 
addressed in a number of studies in which pathological samples were reviewed. Colon 
cancer was actually found to be somewhat over-reported in death certificates primarily 
5 because of inclusion of a portion of rectal tumors [28]. 

D. Mortality data adjusted for historical and age-specific sur\'ival probability and 
reporting error: definition of OBS*(h,t). 

We can now use these available data to improve our estimates of actual 
occurrence rates of colon cancer. We are persuaded that the amended data set is of 
0 sufficient accuracy to permit application of the mathematical analyses we employ. Our 
.models are, however, exphcit and allow exploration of the effect of errors in survival or 
reporting data on the estimation of parameters. 

Figs 9 A, 9B, lOA and lOB recast the data of Figs 5A, 5B, 6A and 63 using all of 
our estimates of S(h,t) and R(h,t) with OBSCh,t) to define a new function OBS'^Ch^t), 

5 (Eq. 8) OBS*(h,t) = OBS(h,t) ^ [R(h,t) (1 - S(h,t))] 

These figures are our best estimates of what colon cancer mortality rates would 
have been in a world with accurate diagnosis and recording but no therapy of any kind. 
In a sense it is a reconstruction of "incidence" data in a world with accurate diagnosis 
but without effective therapy. 

An example of the importance of accounting for survival and reporting error may 
be noted by comparing the function OBS(lS80s,t) for the EAM cohort of Fig. 5 A to the 
function OBS*(1 880s,t) for the same cohort in Fig. 9A In the fomier, OBS(h,t) appears 
to reach -a stable maximum plateau by age 90, but in the latter, OBS*(h,t) shows a clear 
maximum declining through age 102.5. A similar effect may be noted by comparing 
within the NEAF cohort OBS(1870s,t) to OBS*(1870s,t). 
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E. Knov^Ti Physiological Parameters. 
Cell kinetic rates 

One of our goals is to evaluate mutation rates per cell division from the cancer mortality 
data. For this we require cell kinetic parameters. In our previous effort [5], we employed 
an estimate of 45 minutes for the length of mitosis in vivo. This was not based on actual 
in vivo observations but relied on observations on the length of mitoses in vitro. It 
appears that this was a significant underestimate. Wright [29] reports that the duration of 
mitosis for the small intestine is 1-1.5 hours in vivo. Likewise, Weinstein [30] found 
that the mitotic time for the human jejunum is 1.4 to 2.2 hours in vivo. (No data were 
found for normal colon.) This suggests that we overestimated our kinetic rates by a 
factor of 2. The lengths of mitosis and apoptosis may additionally differ among norma! 
transition cells, adenoma and carcinoma cells. Our treatment here assumes they axe in 
fact the same for these three cases. 

However, we also mistakenly assumed that the window of detecdon for mitosis 
when looking at a tissue section on a slide was twice the length of mitosis [5], when in 
fact it is simply the length of mitosis. These t\\^o factors fortuitously negated each other, 
so that the values for the cell kinetic parameters in adenomas and carcinomas vi/e have 
previously reported are, accidentally, correct. Our estimate of the cell division rate in 
colon adenomas, a, is 9 divisions per year and the cell division rate in early colon 
carcinom.as, ac, is about 29. We find the cell death rate in colon adenomas, and early 
carcinomas, pc, to be approximately equal, p = pc = 9. We did need to correct for an 
error in estimating the division and death rate of normal colon epithelium, t. Only half of 
the cells would undergo division, since one half are non-dividing terminal cells. The 
actual estimate for it would be twice our previous estimate of 1.5 of divisions or deaths 
per year [5]. The turnover rate of normal colon epithelial cells is thus approximately 3. 

Cell Number 

In order to estimate mutation rates, we also needed to know the number of colonic 
epithelial cells as a function of age. We estimated that the volume of an organ increases 



wo 00/34652 



PCT/US99/29379 



proponionally to the mass of an average individual. Since the colon is approxinnately a 
cylindrical tube, we inferred that the number of colonic epithelial cells is proportional to 
body mass to the two-thirds power. 

Figs 1 lA and 1 IB show the masses of average males and females respectively as 
5 a function of age. For both males and females, body mass increases exponentially from 
age 1.5 years to 14.5 years in females and 16.5 years in males. A higher constant rate is 
obtained for growth between birth and age 1.5 years. 

From Figures 1 lA and 1 IB, we estimated the grov^th rates of males and females 
from the slope of the log2 of the mass of average individuals for the age inten^als 0-1.5 
10 and 1.5-14.5 for females and 1.5 to 16.5 for males. These estimated grovrth rates for 
mass were then multiplied by 2/3 to obtain our estimates for the growth rates of colonic 
epithelial cells. 

Ages Growth rate (mass) Growth rate (colonic cells) 



Males 



15 0-1.5 



1.23 



0.82 



1.5-16,5 



0.159 



0.1 06 



Females 



0-1.5 



1.17 



0.78 



1.5-16.5 



0.167 



0,111 



20 The number of colon epithelial cells as a function of age, Na,, can therefore be written as 
a discontinuous function based upon the number of colonic epithelial cells in an. adult, 
Nmax, We illustrate this with the values for males. 
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(Eq. 9) 



max 



cells in adult organ 



a > 16.5 



N 



a, males 



N, 



max 



max 



^0J06(16.5-a) ' 
^0,106(15) + 0.82(1.5^a) 



1.5 < a < 16.5 



0 < a < 1.5 



The number of colon cells in a female follows by similar reasoning. 
One should also account for the fact that the weight of an average female is about 80% 
that of a male at age 18 [31]. The estimate in Herrero-Jimenez et al [5] for the number of 
cells in a colon, Nmax = 8.5 x 10'^ cells, made no distinction as to the gender. As a 
beuer approximation, we have used herein an estimate of 9.1 x 10'° colonic cells in an 
aduh male colon and 7.9 x 10^° in an adult female. 

Somatic Mutation Rates in Humans 

We have compiled all published reports of the age-specific mutant fraction at the hprt 
locus in human peripheral T cells [32-42]. Observations with absolute cloning ~ 
efficiencies less than 20% were excluded to ehminate this form of bias. Fig. 12 shows 
these mutant fractions as a^function of age. These data show a similar distribution 
around the mean for all age groups, 0-9, 10-19, etc. up to age 75 after which the number 
of persons with relatively high mutant fi-actions appears to declme markedly. Using all 
of the data from ages 0-75, we calculate a constant rate of hprt loss of 2.1 x 10'"^ 
mutations per stem cell year or about 0.7 x lO'*^ mutations per stem cell division. This 
estimate assumes 3 stem cell divisions per year for pluripotent cells. In human B-cell 
cultures, the obsen/ed spontaneous rates of mutation at the hprt locus ranges from 0.5 to 
2.5 X 10'' mutations pVr cell division [43-45]. These in vivo and in vitro estimates are in 
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reasonable agreement and represent loss of an active gene copy by point mutations and 
large deletions but not by recombination. 

Two estimates of LOH rates in humans differ significantly. Grist et al [6] 
reported that the sum of all pathways for loss of heterozygosity of the HLA-A locus in 
peripheral T-cells precursors was about 6.6 x 10'^ events per cell year or 2.2 x 10''^ per 
stem cell division. However, Fuller et al [46] and Jass et al [47] report observations 
which allowed us to calculate rates of colon unicryptal LOH of about 2 x 10'^ per stem 
cell year or about 7 x 10'^ LOH events per colonic stem cell division, 30 times higher 
than LOH rates seen in blood cells. A cr}Tt's negative phenotype for 0-acetylation of 
sialic acid presumably by loss of an active allele of O-acetylate transferase was used as 
the LOH assay. 

METHODS : LOGICAL AND MATHEIvlATIC/X APPROACHES 
A. The Number of Subpopulations at Risk. 

The data of Figs 5 A, 5B, 6A and 63 comprise all recorded deaths from intestinal 
cancers which we use as an approximation to colon cancer. \^rhen survival and 
under-reporting are accounted, as in Figs. 9A, 9B, lOA and lOB, it is clear by inspection 
that these functions reach a miaximum in old age. This repeated obsen^ation is consistent 
with expectation for a population in which only some fraction is at lifetime risk of colon 
cancer. While recognizing that other explanations for such a maximum mzy be devised 
we build our analysis on the validity of the subpopulation at risk assumption and the 
certain Icnowledge that human populations display a high degree of genetic 
heterogeneity. 

These data do not, however, separate deaths in famihes with familial 
adenomatous polyposis coli (FAPC) from deaths in families with hereditary 
nonpolyposis colon cancer (HNPCC or L>Tich syndrome) or from deaths by "sporadic" 
colon cancer. "Sporadic" cancers themselves are undifferentiated with regard to the 
possibility that there are independent pathways of genetic changes leading to several 
different kinds of "sporadic" colon cancer. 
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We posit that there could be multiple pathways to cancer in any particular organ. 
The potential for and rate of transit of these pathways would be determined by unknown 
but ascertainable alleles of tumor suppressor genes and genes which effect the rates of 
genetic changes and cell kinetic rates in normal tissues and preneoplastic colonies 
(adenomas). These alleles would be distributed throughout the entire population. 
There are cancers of organs for which such a treatment assuming multiple pathways is 
obviously required. In Fig. 13, we show OBS(h,t) for death by testicular cancer in which 
two populations are clearly evident, one with all deaths occurring betv^'een ages 15 and 
40 and a second group in which deaths begin to be obser\'ed after age 50. Mortality data 
lump the deaths from multiple independent pathways together perforce. 

To begin deconvolution of the existing mortality data for colon cancer, we note 
that however many possible pathways to mortal colon cancer inherent in a particular 
individual, death can be caused by only one. 

Thus we may stipulate that the number of colon cancer deaths must be the sum of 
the deaths caused by each of the potentially multiple pathways: 

(Eq. 10) OBS(h,t) = OBSl(h,t) + 0BS2(h,t) + 0BS3(h,t) 

In the case of colon cancer, we know that mortahty from ¥APC and HNPCC families is 
numerically small and occurs earlier in life than the "sporadic" form(s) of the disease. 
For the time being we neglect their real but numerically small contribution to total colon 
cancer mortality. 

Som.e 80% of colorectal adenomas in FAPC individuals have been found to lack 
an operative APC allele. It appears, therefore, that "sporadic" colon cancers have a 
common initiation pathway, loss of the two inherited operative alleles of the tumor 
suppressor gene APC [7]. 

It is also tempting to assume that the genetic change(s) needed in the promotion 
of a"sporadic" colon adenoma cell to a carcinoma cell would be the siine for all 
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individuals, but this assumption is without any evidentiary support and unnecessarj'' for 
the analyses attempted below. 

Assumption of a Single Subpopulation at Risk 

These points being noted, we initially treat the colon mortahty data as if there were one 
5 and only one pathway to colon cancer. If there are multiple pathways to "sporadic" 
cancer, our derivations of parameters such as the number of mutations required in 
initiation and promotion, their rates and the growth rate of adenomas are, perforce, a 
weighted average among the multiple pathways. 

B, The Sizes of Subpopulations at Risk. 

10 1 . Primary versus Secondary Risk Factors 

Here the term "primary risk" requires careful definition. We imagine that there 
are persons who by virtue of their genetic inheritance and environmental experience are 
at risk of death by sporadic colon cancer. It is possible that the entire population has the 
same genetic risk but not the same environmental experience. Conversely, it is possible 

1 5 that a common environmental experience is shared by all persons only a fraction of 

whom carry an inherited risk factor. The key postulate is that persons who do not inherit 
and experience these primary risk factors cannot develop colon cancer in a full lifespan 
of up to, say, 125 years. Primar}^ genetic and environmental risk factors for sporadic 
colon cancer have not yet been identified and are, therefore, hypothetical. (The primary 

20 genetic risk factors for two forms of familial colon cancer, FAPC and Lynch syndrome 
(HNPCC), are an inactive allele of the APC gene or of a mismatch repair gene 
respectively) [48, 49]. 

Within subpopulations at primary risk, variations in mutation rates and cell 
kinetic rates are to be expected. Wnen an inherited condition or environmental 

25 experience lowers the expected age of death relative to all persons at primary risk, we 
define it as' a secondary risk factor. For instance, persons with mutation rates only 
twofold higher than average would be expected to develop cancers much earlier in life 
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than persons with average mutation rates within the subpopulation with the same 
primary risk factors [5]. Inherited or environmental factors affecting mutation rates or 
adenomatous growth rate would, by this definition, be secondary risk factors. 

2. Algebraic Approach to Definition of Primary Risk Fraction 

We are seeking an algebraic way to express the relationship between the 
age-specific colon cancer mortahry for any birth year cohort, OBS(h,t), and the primarj' 
and secondary, inherited and environmental factors which would be expected to 
influence it. 

We define the population fraction at primary risk within a birth year cohort as 
F(h,t) where "h" is the historical birth year and "t" is age. (The calculation of this 
function and its historical changes is an important goal of our analytic effort.) The 
interaction of inherited and environmental primar>^ risk factors is represented as: 

CEq. 1 1) F(h,t) = F(h,t),,,,,, X F(h,t),,,,,„^,,,, (Fig. 14) 

We assume that there is httle historical variation in the fi-action of the population 
inheriting primary genetic risk factors, i,e. F(h,t)genetic = G is a constant. Thus, any 
real change in F(h,t) with '^h" would be ascribed to historical changes in the 
environmental primary risk factor, F(h,t)environmentaL Environmental risk factors for 
some cancers have obvious and well recorded historical variations, e.g. cigarette 
smoking and lung cancer [50]. (We believe, but leave for the future formal argument, 
that cigarette smoking is a primary environmental risk factor.) 

Since historical changes in environmental factors would rarely reach all of the 
population simultaneously, primary environmental factors can vary significantly within 
the lifetimes of some birth year cohorts, e.g. the cohorts for whom manufactured 
cigarettes were not available until middle age. At this stage of model development, 
however, we still treat F(h,t) as invariant within a birthyear cohort in the case of colon 
cancer, so that: 
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(Eq. 12) F(h,t) Fh = GxE, 



The idea of a fraction of the population with both inherited and environmental risk 
factors for colon cancer is logically straightforward. That fraction is represented as G Eh. 
But it follows that there also exist three other distinct subpopulations: those that have 
5 neither risk factor, (1 - G) (1-EJ, those that have the environmental but not the inherited 
nsk, (1 - G) Eh, and those that have the inherited but not the environmental risk, G 
(1-E,). This point is illustrated in a Venn diagram. Fig. 10. Each of these subfractions 
has potentially different age-specif c death rates and this complication is now addressed. 

3. Dennition and summation of causes of mortality. 
0 We hold it to be self-evident that the total number of deaths of persons within an 

age inter\'al in any historical year is the sum of the number of deaths from all possible 
causes. Thus, we have denned the total recorded mortality rate, TOT(h,t), as the sum of 
the rate of colon cancer deaths, OBS(h,t), the rate of deaths from connected causes 
sharing the same primary genetic and/or environmental risk factors as colon cancer, 
5 CON(h,t), and the rate of deaths from causes independent of the primary genetic or 
environmental causes of colon cancer, IND(h,t) [5]. We have represented this concept 
as: 



(Eq. 1 3) TOT(h,t) = OBS(h,t) + CON(h,t) + IND(h,t) 

The historical record defines estimates of TOT(h,t) and OBS(h,t) whereas the values of 
CON(h,t) and IND(h,t) are unknown. 

Related to this statement is the recognition that for each of these categories of 
mortahty there are related age-dependent probabilities. These are not simply equal to the 
recorded mortalit}' fractions, a misimpression unintentionally conveyed in 
Herrero-Jimenez et al. (5). Rather, these probabilities are abstf"act, age-dependent 
functions which we here more carefully define: 
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PoBsC^:^) = probability that a person bom in year *h' with both primary genetic and 
environmental risk of colon cancer would die of colon cancer at age *t' given no 
treatment and no competing forms of death, (population at risk: G • E^,, Fig. 14). 
Unreported colon cancer deaths are included in this category. 

5 PcoNCh,t) ^ probability that a person bom in year *h' with either primary genetic and/or 
envirorLTnental risk of colon cancer would die of any disease other than colon cancer 
connected to either or both of these risks at age 't', given no treatment and no competing 
independent forms of death, (populations at risk: G • E^,, G • E^, and G • E^, Fig. 14) 

PIND(h,t) = probability that a person born in year 'h' with neither a primary genetic nor 
10 environmental risk of colon cancer would die of any other cause at age 't'. (all 
populations at risk: Fig. 14) 

In particular it should be noted that OBS*(h,t) represents the observed recorded 
colon cancer mortality rate for individuals bom in year 'h' who were still alive at aee 'V 
in an abstract worid without therapeutic treatment. Po3s(h,t) is the expected colon cancer 

15 mortality rate for an individual belonging to the group Fh who is still alive at age 't\ also 
given no medical intsr\'ention. OBS*(h,t) will suffer from errors in reponing and 
diagnosis not accounted by our use of R(h,t). But PqesCI^.O, the derived mortality 
probability for persons in the subpopulation G • Eh represents all colon cancer deaths 
whether they are diagnosed and/or reported accurately or not in an abstract world where 

20 S(h,t) = 0. 

The term for the actual probability of death from colon cancer for a person at risk 
of birth year cohort h and age t would be the probabihty of dying of colon cancer in the 
absence of treatment, PobsC^^O. multiplied by the probability that treatment has not been 
successful (1 - S(h,t)). Similar arguments can be introduced for the connected and 
25 independent -forms of mortality so that the probabihty of not surviving a 'connected' 
disease would be (1 - ScoN(h,t)) and of an 'independent' fonn of death, (1 - Sc,^(h,t)). 
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4. Probability of being alive at age t, P^qtC^.O- 

We assume that a person at primary risk of colon cancer can die from colon 
cancer, a 'connected' disease caused by either the inherited and/or environmental risk 
factors, or a cause independent of the primaiy risk factors for colon cancer. This permits 
5 us to write an explicit statement for the probabihty, P^orCKt), that a person within the 
risk group F, has not yet died from any cause at any age between birth and age 'f . 

(Eq. 14) 

This expression is important because in considering the probability of death by 
colon cancer at age 't', one required physical condition is that the individual not be 
10 aheady dead. In the terminology of probability^ and analysis, we are writing equations ' 
for the conditional probability of colon cancer given the fact that the individual is not 
dead. In considering the diminution of the subpopulation F, = G • E„ we are using a 
probability model of sampling without replacem.ent. 

5. Observed colon cancer mortaljt>' rate at a?e t, OBS(h,t) 
15 Ln Herrero- Jimenez et al [5] we derived an equation relating the observed mortahty rate, 
OBS(h,t) to the expected mortality rate Po3s(h,t). Our model accounted for the fact that ' 
there could be forms of death other than colon cancer which depend on either the ' 
inherited or environmental risk factors for colon cancer. The sum of all deaths by these 
possibilities within the group at risk for colon cancer was represented as simply 
20 CON(h,t) in a previous stage of model development [5]. 

But amending our previous model, we have had to make two changes. The ilrst is 
to include unreported colon cancer deaths in the term OBS*Ch,t) rather than CON(h,t) by 
virtue of the use of R(h,t). The second was to differentiate between the diseases related 
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to either but not both of the inherited or the environmental risk factors. To do this we 
introduce the term CONl(h,t) to account for the former and CON2(h,t) for the latter. We 
illustrate these risk factors using the population descriptors firom Fig. 14 in Table 3. 

Table 3. Forms of death for which designated subpopulations are at risk. 



GEh 






GEh 


OBS 








CON 


coNr 


C0N2* 




IND 


IND 


IND 


(ND 



* All forms of death in either CONl or CON2 are included in the entire set of connected 
forms of death, CON. 

Amended to better account for the effects of survival, SCh,t), underreporting error, R(h,t), 
and the four distinct populations introduced'above (Fig, 14), the complete equation for 
OBS(h,t) may be \NTitten as follows: 
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(Eq. 15) 



OBSCh.t) = • CG ■ £,^) . (] . S(h.O) . R(h.t) ■ P^^,( h.t) . P^o_(h.T) 



i 



t 



+ G - (1 - E^y e ° 



t 



W"'^)^'-Sii,rDCh,t))^r 
- (1 - G) • (I - Ej^):- e j 

OBSCh,t) is simply ihe number of persons within a birth cohort 'h' who are recorded as 
dying of colon cancer at age 'f di vided by the number of all persons in the cohort still 
alive at that age. 

5 The numerator, the number of deaths from colon cancer at age 't', is the product 

of the number of persons in the cohort at birth, B^, the fraction at priraaiy risk, CG x EJ, 
the fraction of individuals with colon cancer that do not survive, (1 - S(h.t)), the 
estimated fraction of colon cancer deaths accurately recorded, R(h,t), the fraction 
expected to die of colon cancer in'the absence of treatment, PoBs(h,t), and the fraction of 
0 (G X EJ not already dead from any cause, PNorCh.t). 

The denominator, the number of persons still auve at age 'f, is the product of the 
number of persons in the cohort at birth, B„ and the sum of the fractions of all 
subpopulations still alive, whether at risk of colon cancer or not. 

Since the terms for number of persons bom to a cohort, B^, and the term 
accounring for sun'ival from causes unrelated to colon cancer risk factors, e USD ^^-'^ ^' 
-SiND(h.o)d.^ are present as factors in the numerator and all terms of the denominator, they 
cancel out. By next dividing the numerator and denominator by e •/tfOBS(h.o(i -s(h,i) -pcoN(h,o 
(1 - scoN(h.t»j dt_ convert this equation into a more manageable form: 
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10 . 
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OBS(h,t) = 



h 



+ C) • G) . E . 5 
n 



G ■ (1 - E J . e 

n 



Apo33 (h.O (1 - S(h.t)) . P^^^(h,)(, - S^^^.Ch,,)) - Pc3^3(h,0(l - S^^^2(h.0)]«^ 



t 

/[P033 (h,t) (1 - S(h,t)) . P^Q^(h.,)(l - S^^^(h,t))J^ 
+ (1 - G) . (1 . E^) . e ^ 

The algebraic elimination of the term for the probability of independent foims of 
5 death IS extremely important smce there is no satisfactory way of detennining its value 
from pubhc mortality records. 

However, terms for deaths fi-om "connected" diseases caused by the primary 
genetic and/or environmental colon cancer risk factors are similarly undefined in the 
public health record. To move beyond this clear absence of data we resort to our first, 
10 and possibly worst, algebraic approximation. 

6. Accounting for deaths by causes connected to primary' risks. The function fh. 

We introduce a new term. f(h,t), the ratio of colon cancer deaths to all deaths 
actually caused by either the inherited or environmental risk factors for colon cancer or 
both. We clearly don't know what the "connected" diseases are and therefore have no 
15 way of knowing what P,,,(h,t). P,,„(h,t), P,,,,(h,t), Scow(h,t), Se,„Ch,t) or SeoK.Ch,t) 
might be. Therefore we are forced to assume pro tempore that this fraction, f(h.t). is 
constant for all ages within a birth year cohort but may vary among birth year cohorts. 
We may imagine that CONl and CON2 include oLher forms of cancer and that these 
might therefore have an age dependence and survival probability similar to that of colon 
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cancer. But they might include some ponion of cardiovascular deaths due to unknown 
but shared environmental risk factors. In balance, we think this approximation is not 
srossly improper: the relative age dependence of colon cancer mortality is not greatly 
different from the major causes of human mortality, vascular disease and other major 
5 cancers. 

Equation 17 defines the approximation using f(h,t) in the context of Equation 16: 



(Eq. 17) 

t 



t . ... 

(I - G) . E^^ f 

t 

l[rOBs(h,t) ()'- S(h,0) + PcON^^'^^^^ " -CON^*^'^^^ " *''c0N2 ^' " ^COni^^-^^^^ 
H- G . (I - E^) . e ^ ^~ 

}[PQ3s(h,t)(l - S(h,;)) -f PcoN^^'^>^' - ScoN^^'^^)'^ 
•^0 -G).{i 'E^)-e^ 



The approximation of Equation 17 distributes the effects of differential death 
1 0 rates among the three populations not at risk for colon cancer. As with Eh, we recognize 
that f(h,t) may vary within the lifetimes of the birth year cohorts analyzed. But for the 
time being we must be satisfied with treating it as a constant weighted average for each 
cohort, such that f(h,t) - f 

It is helpful to recall that the sum of all four subpopulation fractions is equal to 

15 one: 
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(Eq. 18) 



[( 1 - G) X EJ + [G X ( 1 - E,)] ^ [CI - G) X ( 1 - E,)] = 1 - (G x E,) = (1 - ) 



Combining Equations 16, 17 and 18, we create the relatively simple expression; 



(Eq. 19) 



OBS(h,t) = 



-0 - S(h,t)) . R(h,t) . P (h,t) 



t 




P^^_(h,t)(]-S(h,t)) di 



for which all values are known save for Fj,, ^ and Poas(^h,t) shown in bold face. 

7. Explicit terms for primary risk factors, F^, and f^,, for a given number of initiation 

mutations, 'n'. 

in our previous effort and 4 were estimated using a maximum likelihood 
10 "method and the general equation derived therein for OBS(h,t). [5] This obliged xis to 
estimate Fh and fh using an algebraic formula which included three additional unknown 
. terms for mutation and cell kinetic rates. .We were unsatisfied with this computational 
condition and thus sought a strateg>^ to exphcitiy calculate the unknown parameters. 
We thus sought to derive explicit terms for Fh and fh for any number of initiation 
15 mutations, *n'. (These two terms are inherently independent of the nxunber of promotion 
mutations, 'm'.) 

Our first tactic was to introduce a simpler function for PoBs(ii^O- This function 
followed the original logic of Nordling [1] who noted that the age dependence of 
phenomena requiring n mutations in the same cell in a cell population of constant size 
20 would rise as a function of age to the pov/er of (n-1); 



wo 00/34652 PCTAJS99/29379 



-77- 



(Eq.20) 

Nordiing PoBs(ii,t) =Kht''-' 

Here, Kj^ is a constant proportional to the product of the 'n' mutational rates and the 
number of cells at risk. It is, in fact, the rate of initiation,, a fact we shall use in deriving 
5 estimates of initiation mutation rates. We modified this model by including a time 

delay, A^, the average latency time between initiation of a normal cell and the promotion 
of any adenoma cell into a malignant form. The modified model becomes; 

(Eq.21) 

Modified Nordling P03S(h,t) = (t - A^)n-1 (t > 

1 0 Substituting Equation 2 1 into Equation 1 9, we notice that for a given value of 'n', there 
are four unknown parameters: K;,, A,,, Fh, and fh. 

CEq.22) n-l 

F . (1 - S(h.t)) • R(h,t) • K . (t - A ) 
OBS(h.t) = 2 — h h (t>Ah)- 



1 f 

— JK -(t - A ) (1-S(h.t)) dt 
h 0 ^ " 



15 To explicitly solve for any or all of these four unkno-wn terms, we needed to nnd four 
independent equations. This has, in fact, now been accomplished. 

8. Parameters determined by inspection: (F^ kJ and A^. 



Ah and (Fj, k^) were determined for each birth year cohort by inspection from the 
mortality data corrected for suTv'ival and under-reporting, OBS*(h,t), as recorded in Figs. 
20 9A, 9B, lOA and lOB. To accomphsh these estimations,, we approximated the OBS*(h,t) 
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for values up to the age tmax when OBS*(h,t) reaches a maximum. The equation is 
suitable for any number of initiation mutations, n. 



(Eq. 23) 



OBS*(h,t) = F, PoBs(h,t) = F, K, (t - A,) 



(t max > t > Ah) 



This formulation is essentially that first suggested by Nordhng (1) who did not 



5 have data for extreme old age and thus could not have recognized the maximum. For 
example, in the case of n = 2, OBS*(h,t) would have a value of zero up to t = A^ and then 
rise linearly with slope (Fh kJ, The x-intercept of the line is A;,. As OBS*(h,0 
approaches its maximum at tmax, the approximation fails. Fig. 15 shows the calculations 
for European- American males bom in the 1 S70s. 
10 A^ and (Fh, can be similarly determined by inspection for all other values of n 

by simply plotting OBS*(h,t) versus tn-1. For the general model with 'n' initiation 
mutations, A^ is simply the x-intercept of the linear portion of the plot of OBS*(h,t) 
versus tn-L and Fh however cannot at this stage be explicitly determined; their 
product, (Fh, k^,), has been explicitly denned for t max > t > A^: 



9. The use of the area under OBS^(h,t) to define F^ in terms of f^. 

Defining the two imknov/ns, Fh and fh required two additional independent 
20 equations. The first was supplied by the integral of the equation OBSR(h,t) vs. t from 
t = 0 to infinit>% where OBSR(h,t) represents the expected mortality rate if all deaths 
from colon cancer had been reported. 



15 (Eq. 24) 



, . (F, K. ) slope of linear portion of OBS*(h,t) vs, r 



n-1 



(Eq. 25) 



OBS''(h,t) = OBS(h,t) - R(h,t) 
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We used the function OBS^(h,t) as it is more easily integrated. This integral must 
be approximately equal to the area observed for the data of OBSR(h,t) vs. t as illustrated 
in Fig. 16. 

We intuitively expected that this area, A^.., must be a function of the population at 
5 risk of contracting colon cancer, F^, and the fraction of the population at risk which 
w^ould actually die from colon cancer, ^ as opposed to forms of death sharing inherited 
or environmental risk factors or both with colon cancer. This area would be independent 
■ of factors' which would affect when deaths are expected, such as sun'ival rates, mutation 
rates and cell kinetics rates, as well as the number of events required for initiation or 
1 0 promotion. 

Tne algebraic relationship among the area under OBSR(h,t) and Fh and fh is 
shown as Equation 26. The derivation of this equation via the explicit integration of 
Equation 25 is provided in the appendix. The simplicity of this result was astonishing. 
More practically, it provided a defmition of Fh as an explicit function of fh and the 
1 5 observed parameter Ah for any cohons studied and, for that matter, any form of cancer 
or other mortal disease. 



(Eq. 26) 

*^h 



OBS^(h,t)^t = - • ln(l - F^) 



Even with this useful obsen^ation we had not yet explicitly determined any of the 
20 unknown terms k^, ^ or 4. Equations 24 and 26 alone did not independently define all 
three terms. We had three unknowTis and only tv^^o independent equations. 

10. The use ofthe maximum value of OB S*(h,t) to define F^ in terms of and fj,. 

For our last necessary equation, we took advantage of the feature that the 
mortality function OBS*Ch,t) reaches a clear maximum in old age as shovvTi in Figs. 9 A, 
25 9B lOA and lOB. The derivative of a continuous function equals zero at a maximum. 
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By talcing the derivative of OBS*(h,t) and setting it equal to zero at t = tmax, the age at 
which OBS*(h,t) is a maximum, we solved for Fh in terms of the other unknowris for 
any value of n. (See appendix for derivation) This general solution is shown as Equation 
27. 

5 (Eq.27) ^ 

1 max , n-l, 

._L j (1 .S(h,t))Kj^(t-A^) dl 

(n - 1) . . (1 - e ^ ) - (1 - S(h,t^^,^)) (t^^^ - A^)'' 

Thus we had derived four independent equations using three separate features of 
the mortality curves, plus the direct obsen^ation of Aj,: 

1. the slope of the (n-l)th root of OBS*(h,t) for Equation 20, 
10 2. the x-intercept of the (n-l)th root OBS*(h,t), direct estimation of A^,, 
. 3. the area under OBS^(h,t) for Equation 22 and 
4. the maximum of OBS*(h,t)) for Equation 23. 

Together these equations allowed us to explicitly determine two desired population risk 
parameters, and fj. for any birth cohort for which these four features were-defixied by 

15 the data. We also solved for the physiological parameter k^^ which we use below to 

estimate initiation mutation rates for any value of n. We did not use maximum likelihood 
methods to determine values for these primary risk factors. Rather, we solved for them 
explicitly after making use of the approximation of Equation 17 defining 4. 

This tactic allowed us to estimate the historical variable Fj,. That should let us 

20 chart the health effects of environmental changes in populations. It is also the minimum 
value for G or E^ for any birth year cohort since when Ej^ =.1, G = F^ and vice versa. 
Both of these properties are of clear value in exploring the genetic and environmental 
interactions which lead to colon cancer. - 
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C. Explicit terms for secondary risk parameters: initiation and promotion mutation rates, 
and adenomatous growth rates for a given n and m. 

Having found an explicit way to calculate the fraction of a birth year cohort at 
risk of colon cancer, we were ready to use algebraic models of tumor initiation and 
5 promotion to calculate ihe values of the mutation and cell kinetics rate which are posited 
to determine the age-specific mortahty rates. 

1 . The product of initiation mutation rates, (r- r^ ij, . . , r J. 

Our model for initiation with n>l required events is based on that of Armitage 
10 and Doll [2] in which 'n' events in any cell would create the first cell of an adenoma. We 
have, however, extended the physiological model to account for cell turnover in normal 
tissues and the organization of tissues as turnover units of constant and equal size 
containing Na total cells at age 'a'. 

The Na total cells com.prise terminal cells, transition cells and stem cells. It may 
1 5 be that initiated terminal cells have zero probability of forming a tumor and, if this were 
the case, the number of cells at risk would be reduced by 1/2. For the time being, we 
model all cells as being at risk as there are several months between teraiinal division and 
' programmed death in a human colon terminal cell and a shorter period, about forty days, 
betv^^een divisions in a htmian colon adenoma cell. The stem cell and each transition cell 
20 undergo t divisions and no deaths per year. The terminal cells each "die" t times per year 
and do not divide. The value of t was determined to be approximately 3. 

We have argued [5] that the most probable pathway of accimiulating n > 1 
mutations in tumor initiation is by the acquisition of all but one of the 'n' mutations in 
the stem cell. The stem cell then repopulates its respective turnover unit with cells 
25 carrying the (n - 1 ) mutations, such that the nth mutation could now occur in any of 
these ceils. 

We represent the rates of the required inidation mutations as r^ r^^ . . . r„. The 
expression describing the number of newly initiated cells in year 'a' is simply: 
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n-1 



(Eq. 28) Iniiiatcd ceils in year 'a' =n t" rjr^ ... r,,) N'a a 

for the case in which the order of the n mutations is inconsequential. Expressions may 
also be derived for models in which a specific order of some or all initiation mutations 
are required. 

5 2. The difference in division and death rates in adenomas, (a-p) , and the stochastic 
extinction of newly initiated cells. 

As recognized and algebraically treated by Moolgavkar [51], each initiated cell 
could die before it divides. Even small colonies have a high probability that all cells will 
die; only a few would be expected to survive as adenomas when the probability of cell 

1 0 division is only marginally greater than the probability of cell death. Given a cell 
division rate of a cell divisions per year and a death rate, b, for an initiated cell, the 
probability of non-extinction or sundval is (a-p)/a [51]. Thus the number of newly 
arising and surviving adenomas in year 'a' would be: 

(Eq. 29) 

Surviving Adenomas (a) = n x" (rj r: ... Tj,) a 

15 cc 

for the case in which the order of mutations is inconsequential. In the model for 
promotion used below, all surviving adenomas have the property of inexorably giving 
rise to a lethal carcinoma via net growth and mutation. Note that the values of cc and p 
have been determined to be approximately 9 divisions or 'deaths' per year, respectively. 
20 [5] The unknown terras for the rate of initiation are the product of the rates of the n 
initiation mutations (r^ r^ r, . . . r,) and the small difference between division and death 
rates in colon adenomas, (a-P). 
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3. Use of the calculated value of Kj, to define an independent equation with two 

unknown terms: (r^ r^h - ^nO (^^~P)- 

• The combination of the data of OBS*(h,t) and Equations 20, 22 and 23 allowed 
explicit deteimination of the unknown parameter for any cohort studied. K;, is 
5 Nordling's annual rate of initiation per person-modified to include Moolgavkar's 

necessary' term for surviving stochastic extinction. Here we write it for the case after Na 
has reached a maximum, Nmax, in young adults. [5] 

(Eq. 30) 

Ku = ^—^ n ( ri tj ri- ... ) N^ax 
^ a 

1 0 Cell division and 'death' rates can be acquired from actual tissue samples: t in normal 
tissue, and a and b in adenomas [5]. However, a value for the growth rate of an adenoma, 
(oc - p), is small and we cannot get accurate independent estimates for the division and 
deaths rates in vivo to properly estimate this difference from tissue samples. 

All terms in Equation 30 are lcnov,Ti except (r^ r^ r^ . . . r,.) and (a-P), but we can 

1 5 make use of an interesting property of OBS*(h,t) to explicitly define (a-P) and then 
estimate the value of (rj rj r^ . . . r,.). For this, we must first extend Nordling's model for 
the expected mortalit>^ from the OBServed disease, Po3s(hL,t), to account for the growth 
rate of an adenoma. 

4. The expected mortality" rate from the OBServed disease, PobsC^'"^)' ^'it^in the group at 
20 risk, 7^. 

In our three-stage carcinogenesis model, we assumed that the third stage, 
progression, occurs rapidly and can be effectively modeled as occurring in zero years. 
Therefore, the expected mortality from the OBServed disease simply equals the 
probability of initiation at age 'a' (Equation 29) times the probability of promotion 
25 occurring at a later age *t'. 
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Our model for the second stage of the three-stage carcinogenesis model, 
promotion, is again based on that of Armitage and Doll [2] in which 'm* particular 
events in any cell of the adenoma would create the first cell of a carcinoma. We first 
considered the simplest case for which a single genetic event could turn an adenoma cell 
5 into a carcinoma cell. Using the Poisson approximation, the probability of at least one 
cell undergoing promotion at age *t' in an adenoma that was initiated at age 'a' is: 
(Eq.31) 



a. 



(2 



(a-p)(t-a) 



- 1) cc^-Pc 



da -e 



In 2 



^(t-a) 



ex. 



) 



The exponential represents the expected number of cells in the adenoma to have 
10 undergone promotion and sundved stochastic extinction, (t - a) years after initiation. 
Here, r^ represents the promotion mutation rate per cell division, and (a^ - ^ o.^ 
represents the probability of a promoted cell colony surviving stochastic extinction, 
given cell division and death rates per year of ac and pc respectively. The remaining 
terms in Equation 31 describe the total num.ber of cell divisions, or chances for 
15 promotion, that have occurred within the adenoma. We have extended our previous 
approximation [5] for the total number of cell divisions to more accurately accoxxnt for 
cell divisions fi-om cells that have died. (See Appendix for derivation) Combining the 
probability of initiation from Equation 29 with the probabilit>' of promotion from 
Equation 31, we set that the expected mortality from the observed disease within the 
20 group at risk is (m = 1 case illustrated): 



(a.p)(t-a) 



(Eq.32) ^ Al^a.pJ 
PoBS<^^'^) " ^l^fk'-^n J^^a rf(t - a) 



In 2 a 

c 



-1 d& 
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Obviously, an individual could develop more than one adenoma within a lifetime, so we 
necessarily accounted for all possible adenomas initiated at any age between birth and 
'I', by integrating across all ages. We can now use OBS*(h,t) to exphcitly define (a-P) ■ 

5 and then estimate the value of (r^ r^ r, . . . r„.). ' 5. The growth rate of adenomas, (a - P). 
Our incorporation of the concept of an intermediate colony, an adenoma, into Nordling's 
carcinogenesis model allowed us to write the expected mortality rate, PoBsC^^t), for those 
individuals within the group at risk, in terms of the initiation mutation rates, the 
promotion mutation rats, and the adenomatous growth rate. Incorporating our updated 

1 0 expected mortality to the estimate for OBS*(h,t) in the younger population in Equation 
23, and then taking the log2 of its derivative conveniently gives a line of the form: 

(Eq. 33) log2 40BS*(h,t)) -dt = (a - p) t + constant 

(See Appendix for derivation) 

fi-om which we can easily read off the slope to get an estimate for the cell kinetic sroA^th 
15 rate of the adenoma, (a-p). As an example, Fig. 17 shows an estimate for EAM bom in 
the 1920s. To evaluate the derivative of OBS*(h,t) from our mortaUty data, we use the 
approximation A(OBS*(h,t)) - At d(OBS*(h,t)) -dt. 

6. Explicit determination of the product of initiation mutation rates, (r; z^r^ ... r„). 

The determination of the cell growth rate of an adenoma, (a P), allowed -us to 
20 calculate the product of the initiation mutation rates, (ri ij rk . . . m) using the previously 
derived value of Kj, and Equation 30. From this product, the geometric mean can Toe 
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derived as (ri ij rk . . . m)(l/n). We next make use of the approximation of the growth 
rate of an adenoma to determine the promotion mutation rates. 

7. The average promotion mutation rates, r^. 

Evaluation of the average promotion mutation rate can be expressed in terms of 
5 previously defined values. The value represented the average time betv^^een initiation 
and promotion. The cumulative probability of promotion in an adenoma is therefore 
"approximately one-half, A^^ years after its initiation. This allowed us to relate the 
obser\'ed delay to the average promotion mutation rate, rA, and the adenomatous growth 
rate, (a - P). For the case of only one necessar>' promotion mutation, m=L this is: 



10 CEq. 34) 



log. 



-1' 



a 



, Ml)] 



(See Appendix for derivation) If more than one genetic event were needed to convert an 
adenoma cell into a carcinoma cell m > 1, we must consider additional phenomena. 
After the accuruulation of each new promotion mutation 'there would arise within the 

1 5 adenoma a new colony of ceils each containing that new promotion mutation. As in the 
case of a newly initiated cell, the new colony must sur%'ive stochastic extinction, such 
that a ftilly promoted cell would necessarily have undergone 'm* possible rounds of net 
growth and stochastic redistribution. 

Each promotion event could alter either the promotion mutation rate or growth 

20 rate of the cells in this new colony. We, however, cannot deconvolute the mortality data 
to evaluate these rates for each of the steps of promotion independently. Still, we can 
relate the total delay between initiation and promotion, A^^, to the geometrical average 
promotion mutation rate and the'average adenomatous growth rate for those cells that 
made up the direct lineage between the first adenoma cell and the first cancer cell. 
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The total expected dslay between iniiiation and promotion, A^, is simply the sum of the 
delays between each promotion event; 



log2 


1 + 


\ r ct f ac-Pc , 1 1 


-r 


+ (in-n'log2 


■ f . ' 1 






t 



5 (See Appendix for derivation) 

This yields us an equation for the general case of 'm' promotion mutations, assummg 
that the order in which the promotion mutations occur is important. Such a mutation 
would be expected to be an early step of promotion as it would increase the chance of an 
adenoma cell undergoing complete promotion within an individual's lifetime. 
10 If the..order in which the 'm' mutations occurred were unimportant, the delay between 
- initiation and nromotion would then be; 



(Eq. 36) 



1 -r 



-0 



[ln(2)fj 



m 



-1 



15 In either case, the only parameter left unknown is the geometric average promotion 
mutation rate, r^^. Either of these equations is sufficient to evaluate this last unknown 
parameter, thereby completing our explicit derivation of all physiological parameters in 
the three-stage carcinogenesis m-odel, ri, (a - P) and r^^. 
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RESULTS 

Union of the model with the public record of colon cancer mortality. 
A. Example of n=2 and m=l . 

As a first demonstration of our algebraic representation of the three-stage 
5 carcinogenesis model, we apply the n=2, m=l case. We chose n=2 based on the 

observations that loss of function of both good copies of the ARC alleles were sufficient 
to imtiate a colon adenoma (Fig. 1 8), and that in our previous effort [5] estimates for 
mutation rates were similar to mutation rates of T-cell precursor cells [6]. We first 
present m=l as this is the simplest case. 

10 Table 4 and Figs 19A, 19B, 20 A and 20B summarize the results for the n=2, 

m=l case, extending Herrero- Jimenez et al [5] by including estimates for 
Non-European-Americans. Initiation mutation rate ri values are reported assuming that ri 
= 1/3 rj. We base this on the observation by Grist et al [6] that the sum of all pathways 
for loss of heterozygosity of the HLA-A locus in T-cell precursors is about 6. 6 x 10""^ 

15 events per cell year, compared to a rate of 2.2 x 10'^ for inactivation by point mutation 
for a single allele of the same gene 
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Table 4. Summary of cell kinetics for colon cancer: (Mitotic time was assumed to be 90 
■ minutes.) 



lirthycar 












Ct - P 


1S40S 


0.30 


0.1 1 


4.4 X 


10-^ 


8.4 X 10*^ 




lS50s 


0.35 


0,13 


4.6 X 


10-^ 


8.8 X 10"® 




1860s 


0.3S 


0.15 


5.6 X 


lo-^ 


8.0 X 10'^ 




1 570s 


0.41 


0.15 


5.7 X 


10-^ 


8.2 X 10'^ 




iSSOs 


0.40 


0.21 


5.9 X 


lo-^ 


1.3 X 10**^ 


0.19 


1890s 


0.40 


0.21 


6.7 X 




8.6 X 10'^ 


0.20 


.1900s 


0.39 


0.24 


7.0 X 




7.6 X 10'^ 


0.2 2 


1910s 


0.45 




S.5 X 




8.1 X 10"^ 


0.19 


1920s 


0.43 




7.6 X 


lo'-' 


8.1 X 10*^ 


0.21 


1930s 


0.42 




7.4 X 


10-' 
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0.21 
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0.28 
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1.6 X 
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Comprehensive analysis of the data for birth year cohorts from 1910 onwards 
was not possible. Estimation of was not possible for these birth year cohorts, as 
reasonable knowledge of how the mortality rates decreases at extreme old age is 
required. We were able to observe (a-P) and thereby calculate the promotion mutation 
5 rate, r^, for more recent birth year cohorts. and were approximated by noting that 
the slope of the colon cancer rates for the later birth years did not show significant 
changes in iheir slope (Figs 9 A, 9B, lOA and lOB), suggesting that the area under the 
mortality cun'es would be constant. If furure data demonstrates that the area actually 
changed, this will necessarily have been due to a result of a change in f, but not in F^. A 
10 change in Fj, would have affected both the slope and area of the age-specific colon 
cancer mortality rate function. 

Similarly, (a-p) could not be ascertained for birth years prior to 1 880 since data 
for the- age inten^al 20-60 are used for this purpose and data were available to us only 
from the reporting year of 1930 forvv^ard. As this value appears invariant for each 
15 population cohort, other parameters were estimated assuming that the adenomatous 
grov.^ rate is a constant. 

3. Example of n=2, but m>l. 

We were attracted to the hypothesis that m=l because the observed value for the 

promotion mutation rate was similar to the rate of LOH in T-cell precursors (5, 6). We 
20 then h>q30thesized that the necessary event for promotion was the loss of heterozygosity 

of any of an imdefmed set of second gatekeeper genes, such that we could define the 

fraction at risk by the fraction of the population heterozygous for at least one of the 

second gatekeeper genes. 

However, this hypothesis was inconsistent with the undisputed fact that colon 
25 tumors display a very high fraction (on average 0,22) of LOH and LOI distributed over 

all chromosomes [12-1 S]..We therefore reconsidered the promotional events in terms of 
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rates of LOH and LOI which would yield LOH and LOI fractions of 0.22 in colon 
carcinomas. 

There are about 9 x 63 = 567 lineal cell divisions between a first adenoma and 
firsi carcinoma cell in colon cancer. The rate of LOH or LOI to achieve a fraction of 
5 0.22 from events in adenomatous grov,lh alone would be 0.22/567 = 3.9 x 10-4 LOH or 
LOI- events per adenoma cell division. 

This estimate can be considered in terms of the geometric means of the 
promotional mutation rates for different values of m. Our calculations are summarized in 
Table 5. 

0 

Table 5. Calculated geometric mean of promotion mutation rates , m=l-4. 
Data are for European-American females bom in the 1 860s. 
(unordered = mutations can occur in any order, 
ordered = mutations must occur in one particular order) 



m 


Unordered lA 


Ordered rA 


1 


1.2 X 10-7 


1.2 X 10-7 


2 


2.5 X 10-5 


3.5 X 10-5 


3 


1.4 X 10-4 


2.8 X 10-4 


4 


3.3 X 10-4 


8.5 X 10-4 


5 


5.4 X 10-4 


.1.6 X 10-3 



For the case of ordered promotional mutations, values of m = 3 and 4 yield 
mutation rates bracketing the LOH/LOI rate of 3.9 x 10*^. For unordered promotional 
events, the values for m = 4 and 5 bracket this value. Such calculations can be of use in 
considering the number of LOH plus LOI events that might be required in tumor 
25 promotion, but such considerations should not lose sight of the fact that there is no 
present evidence that either LOH or LOI .events are required in promotion. 
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Discussion 

Major Conclusions. 

Use of the data for survival probabilities as a function of age and history have 
extended and confirmed our earlier interpretation based on mortality data alone: there is 
5 a true maximum in the age-specific colon cancer mortality rate for males and females in 
both European- and African-American subpopulations. This interpretation is essential 
for the development of a means to calculate the fraction at risk of colon cancer for each 
birth year cohort. 

We have developed and used an extended Knudson-Moolgavkar model for 
10 initiation and promotion in which *n' rare events are required for initiation and 'm' for 
promotion [5]. With it and the approximation of Equation 17 to account for competing 
forms of death snaring environmental and/or genetic risk factors with colon cancer, we 
have calculated birth year cohort-specific values for the fraction at primary risk, the 
product of initiation mutation rates, the product of promotion mutation rates, and the 
15 average growth rate of the adenomatous intermediate colony. 

Since these parameters have been calculated for each birth year cohort, their 
historical changes may be observed. These in turn may be considered in terms of 
historical changes in human habits and their environment. These parameters may be 
compared between the tvvo large demographic cohorts for which data are available. 
20 Similarly, the parameters for males and females may be compared. 

Historical changes in the fraction at risk, Fj,,, 

The fraction at primary risk of colon cancer has remained essentially constant for 
the birth year cohorts of the 1860s to the 1940s (Figs. 19A and 19B). This is true for 
males and females of both European or African heritage. This fraction is about O.4. It is 
25 possible that this fraction at risk was increasing from 0.3 for the birth cohort for the 
1840s to 0.4 for the lS70s. 

The constancy of this fraction during a period of marked changes in American 
life in nutrition, smoking habits, level of exercise^ industrialization and urbaniza^tion is 
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striking. These data suggest that none of these known environmental changes had any 
• effect on the fraction at risk due to enviromnental risk factors. It might be imagined that 
there have been offsetting environmental changes but such arguments, absent data, 
violate the "law of parsimony". It should be noted that this result does not indicate that 
5 there are no environmental factors affecting age-specific colon cancer rates. 

Subpopulations with conditions varying significantly from the population average might 
have higher or lower rates depending on their circumstances. 

Magnimde of the population at risk. 

It may be surprising that the fraction at risk is as large as 40%, given that less 
1 0 than 5% of all deaths result from colon cancer. This result, however, emphasizes the 
imponance and necessity of accounting for all other connected forms of death in 
calculating the primary risk fraction for any monal disease. 

The estimate of 0.4 represents a minimum value for the fraction at primary 
genetic risk. If all persons were at environmental risk, then the fraction at primary 
15 genetic risk would be 0.4. 

Population genetics of primary risk for colon cancer 

Case I: Genetic risk is conferred by a dominant mutation non-deleterious for' 
reproductive fimess. 

In this case, homozygous recessives (wild t^pe) would have zero genetic risk but 

20 heterozygotes and homozygous dominants would be at equal risk, For the case of 
monogenic risk in which heterozygous dominants and homozygous dominants have 
equivalent phenotypes, we will assume the dominant and recessive alleles are in 
Hardy- Weinberg equilibrium. Tne sum of heterozygotes and homozygous dominant 
fractions would thus be 0.4 == 2pq + q^ "p" is the allele frequency of recessive axid "q" 

25 of dominant alleles such that p + q = 1. Solving this quadratic equation for q, we rind q= 
0.23 as the only physically possible solution since q ^ 1. 
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Thus, for the case of a dominant monogenic primary genetic risk factor, the sum 
of inherited alleles coding for risk would be 0.23. One should also consider the possible 
values of q for multigenic or polygenic genetic risks. For multigenic risk the average 
value of q would be less than 0.23, and for polygenic risk average q's would be on the 
5 order of 0.5 or higher. - 

These estimates are in the realm of possibility if there were no physiological 
effect on reproductive fitness for homozygous or heterozygous states. The physiological 
effect would be limited to a risk of death by colon cancer at advanced age. Since the 
average rate of gene mutations leading to gene loss is about 3x10'^ per human generation 
1 0 and there have been about 1 0** human generations, the accumulated mutant fraction of 
about 0.3 would be expected for the sum of a set of neutral alleles for a single gene. 

A h>T30thesis that primary genetic risk for colon cancer is defined by any of a set 
of non-deleterious dominant mutations in one or several genes is thus not inconsistent 
with the calculated primary risk fi-action of 0.4. The physiological effect of such a 
1 5 dominant mutation could affect initiation, promotion or progression, there beiag no way 
to differentiate among these possibilities with existing data or understanding of 
carcinogenesis. 

Case II: Genetic risk is conferred by homozygosity for a recessive mutation 
non-deleterious for reproductive fitness. 
20 In the case where primary genetic risk for colon cancer requires inheritance of 

tv^^o recessive alleles of the same gene, neither of which affect reproductive fitrtess, the 
firaction of recessive homozygotes would be q2 = 0.4 and q = 0,63 for a monogenic 
disorder. 

Since these recessive alleles in homoz\'gous or heterozygous form have by our 
25 defmitipn no effect on reproductive fitness, that they might have reached so high a 
firaction in present day populations if the mutation rate for a single gene were al^out 
twice the average for all gene inactivating mutations or if the risk were distriba'ted over 
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several different genes. As in Case I, one could not logically deduce which stage of 
carcinogenesis might be affected by the recessive homozygous state. 

Case III: Genetic risk is conferred by a recessive mutation deleterious for reproductive 
fitness. 

5 ' A third possibility is that risk is conferred by a set of alleles in one or more genes 

in which homozygosity for such mutations is lethal in embryos or at least prevents 
reproduction. We again assume that these alleles are in Hardy-Weinberg equilibrium and 
that mutations leading to gene loss, average about 3x10"' per generation. From these 
assumptions we would expect the sum of mutant allele fractions for heterozygotes in any 

10 one gene to range from about 0.005 to 0.03 in the population. The actual value for any 
gene would depend on gene size and the presence of panicularly marked mutational 
hotspots. For these to sum to 0.4, a multigenic model is obviously required. Forty - 
separate genes each at a Hardy-Weinberg equilibrium value of about 1% would be a 
h>TDothesis consistent with these calculations. 

15 "A model considering a polygenic combination of deleterious recessive alleles 

would have to consider a ver>^ large number of genes (>1000). This consideration leads 
us to conclude that a combination of LOH events during promotion involving two or 
more genes carnang alleles deleterious for fitness is an imlikely scenario. 
In colon cancer, it would appear that any required event involving loss of heterozygosity 

20 would occur during promotion since the events of initiation are accounted as loss of rwo 
wild type APC alleles, and events in progression would not be rate-limiting. One 
inherited condition of heterozygosity would be sufficient to account for primary risk 
since any number of required LOI events would presumably place all individuals at 
equal risk. 

25 In summary, a primary genetic risk fraction of 0.4 or higher could be conferred 

by mutant alleles of one or a few genes if reproductive fitness were not affected, Bxit 40 
or so genes would seem to be required to create so high a primary genetic risk fraction if 
homozygosity for the mutant alleles did prevent reproduction. 
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In the former case, the original alleles occurring as hotspot mutations would have arisen 
and been fixed multiple times throughout human history. In the latter case, selecrion 
asainst ancient deleterious mutations would leave only relatively recent mutations. In 
large present day populations, such as those of Asia, Africa and Europe, these would be 
5 expected to be distributed over very large numbers of families. 

The impiicaLions of a minimum primar>' genetic risk" fraction of 0.4 and the 
consideration of the cases considered here have obvious application in designing a 
' means to fmd the gene or genes hypothetically carrying such risks. 

The factor accounting for connected risks, 

10 As may be seen in Table 4, 4 increases markedly in historical time for three of 

the four cohorts and has a maximum value of 0.24 in the most recent 
European-.American male cohort for which fh may be calculated. Values for males are 
generally higher than for females. 

This factor accounts for both underdiagnoses, underreporting, and deaths of 

15 persons at risk of colon cancer by other diseases which share the genetic and^or 

environmental risk factor(s). Underdiagnoses and underreporting should have decreased 
from 1930 to 1992. The increasing value of fh is probably in part accounted for by this 
trend. On the other hand, the low value of fh derived from the populations bom in this 
century suggests that the genetic and/ or environmental risk factors for colon cancer are 

20 responsible for a significant fraction of other deaths-. Given that colon cancer accounts 
for somewhat less than 5% of all deaths and that the value of fh is about 0.2, one must 
consider that risks for colon cancer are associated with as much as 25% of all deaths. So 
large a fraction could comprise all cancer deaths; altemately, the genetic or 
environmental risk factors for colon cancer could contribute to a large fraction of 

25 vascular disease. 

As noted in the text, 4 is an approximation forced upon us by igiiorance of any 
forms of death sharing risks with colon cancer. It is the shakiest part of our modeling 
effort and represents an area in which more theoretical work is needed. 
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Mutation rates in initiation. 

■ Our estimates of the rate of the first initiation mutatioB for the condition n=2 
varies from 4 to 8 x 10-^ over all four gender and ethnic cohorts for the birth year cohorts 
from the 1840s to 1940s. In making these estimates, we have relied on Grist et al. [5] 
5 who estimated that the ratio of the loss of an active gene by primary mutation was 
approximately one third the rate of allelic loss by LOH. Thus, r^ = 3 rj has been used to 
calculate ri after the product riij was calculated. These values are remarkably similar to 
obsen^'ed rates of spontaneous mutations for gene inactivation in human cell culture 
observed to be about 10'^ per cell division. They are almost identical to an estimate of 
10 about 0.7 X 10*^ per stem cell division derived from the age dependent hpn mutant 
fractions in human peripheral T cells assuming three stem cell divisions per year (Fig. 
12) [32-42]. 

We are persuaded that these values are consistent with a model of loss of the first 
APC allele in colonic stem cells at a rate of about 7 x 10"^ per stem cell division and a 
15 rate of LOH for the second allele at a rate of about 2.1 x 10'' per stem or transition cell 
division. 

So close are these calculated values to observed hiunan in vivo mutation and 
LOH rates that w^e will assume n = 2 for colon cancer initiation tmtil contradictory 
evidence is discovered. The mutation rate for European-.Ajnerican Females is essentially 
20 invariant at 7 x 1 0'" with historical time. But the data suggest a significant increase in 
mutation rates in both male cohons from a steady value of 4 x 1 0'' from the 1 840s 
through 1880s to over 6 x 10'*^ from the 1840s through the 1940s. African- American 
Females appear to show a steady increase from 4x10'"^ in the lS40s to the 1900s when it 
reaches the rate of 7 x 1 0'' seen in Etiropean- American Females. 

25 Mutation rates in promotion. 

;. , We have no idea how many genetic changes are required for promotion in colon 
cancer and thtis must consider the value for the geometric mean of the mutation rates for 
m=l, 2, 3 .... as our estimate of r^. 
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These genetic changes could be gene "activation" missense mutations, gene 
inactivation events, LOH for an inherited heterozygous state, or loss of imprinting of a 
gene by other mechanisms (LOT), These processes could involve point mutations, 
recombination, chromosomal and or chromosomal segment loss. As noted above, we 
believe, on the basis of population genetics, there could be one and only one promotional 
LOH event in the case of an inherited recessive allele deleterious for fitness in the 
inherited homoz^'gous state. 

For the case m=l, the estimated value of is about 2x10'^ per cell division for 
females and 8 x 10-8 in males a value which is approximated by LOH in human T ceils 
in vivo or gene inactivation of a somewhat larger than average gene. It is much lower 
than the colonic stem cell LOH rate of 7 x 10'^ derived from Fuller et al [46] and Jass et 
al [47]. These ideas address the case of a monogenic condition. If any of multiple genes 
were involved, then activation of any of several proto-oncogenes might also be 
considered a numerically reasonable hypothesis. 

For the case of m > 1 , the estim^ate of rA rises with m as indicated in Table 4. As 
in our previous effort [5] it is clear that for m=l no increase in promotional mutation 
rates above those seen in normal hum.an T cells need be invoked to account for the 
age-specific colon cancer rates in humans. For m =2, a recombination rate somewhat 
higher than 7 x 10"^ would be required. 

Curiously, the historical estimate of this promotion mutation assuming m = 1 is 
remarkably constant for both European and African- American males at about 8 x 10"^ 
For both European- American and African- American females it appears to have risen 
significantly from the mid-ninetieth century to a constant le\'el of about 2.5 x lO'*^ since 
the 1 890s. 

The differences betv^^een the genders and similarities between the ethnic groups 
may give us some reason to place confidence in these results. This would lead us to ask 
what environmental changes affected all women begiiming in the 1850s that was 
completed by the 1 890s which might conceivably have affected promotional miitation 
rates. On the other hand, the differences, while apparent, m.ay have arisen by the. action 
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of unJoiown biases in reporting or diagnosis which were in some way gender specific. 
Given the economic differences of the ethnic groups, however, one would have expected 
such biases to effect comparison between ethnic groups. Such differences, do not appear 
at all. 

5 Adenomatous growth rates. 

These values are extraordinarily constant at about 0.2 for males and 0.1 7 for 
females over the entire historical period analyzed. The gender specific differences appear 
to be real and constant over a century of birth year cohorts. Tnere appear to be no 
differences betv^-een the mo ethnic groups. It would appear that the many environmental 
1 0 changes during the century obsen'edhave had no effect on the colon adenomatous net 
growth rate. 

It is worth noting that these net adenomatous growth rates of 0.2 and 0. 1 7 
doublings per year are remarkably similar to the net growth rates of children which are 
about 0.16 (Fiss 11 A, 1 IB). Net colon carcinoma growth rates are about 20 doublings 
1 5 per year which may be compared to the net growth rate of the human fetus of some 54 
doublings per year. 

These similarities give a quantitative basis for the idea that the genetic steps of 
carcinogenesis recreate the conditions of fetal and postnatal gromh in reverse. 
"Oncogeny decapitulates ontogeny" sums up this idea. In this scenario, the mutations" 
20 permitting adenomatous gro%vth take the cell back to the gro^^^ rates of children while 
the additional change(s) creating a carcinoma cell permit the more rapid groN^^h rate of 
fetal life. 

■ It is necessary to note that the observations of a, b and t were made in adult 
. colons. It v/ould be interesting to know if t changes in- neonatal and childhood grovrth. 
25 Actually, we do not even know at this point if childhood growth involves an increase in 
the number of stem cells, an increase in the size of turnover units or both. 
The turnover rate in normal adult colons is about 3 divisions per year and in colonic 
adenomas about 9 divisions per year. But in colon carcinomas the death rate is 
approximately the same as in adenomas, about 9 per year, while the division rate rises to 
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29 per year. (Despite this higher division rate, one should note that the average time 
between carcinoma cell divisions is about 13 days, far longer than the one day division 
times of cells in culture.) . 

High LOH and LOI levels in human colon carcinomas . 
5 The fi-action of humans showing LOH or LOI for a particular reporter locus is 

generally about 0.22. Since this high fraction would not be produced in adenomatous 
• growth at the LOH rate observed in human T cells of about 2x10"' mutations per cell 
division, we considered what rate of LOH/LOI would be necessary to achieve such a 
fraction. This rate was calculated out to be about 4x10-' per cell division. 
1 0 This represents an estimated 2000 fold greater rate than obsen'ed in normal human 
lymphoid cells in vivo and in vitro and about 60 fold higher than estimated for colon 
stem cell LOH rates. So high a rate would accommodate a value of m = 4 LOH/LOI 
events required for promotion (Table 5). 

These calculations yield the LOH or LOI rates per cell division if all of the LOH 
1 5 and LOI observed in carcinomas occurred during the gro%\th of adenomas leading to a 
single initial carcinoma cell. 

Even if this high LOH and LOI rate occurred in colon adenomas, it would not 
necessitate the conclusion that the-number of promotional events were 3 or 4 or that 
LOH or LOI were involved in promotion. Even if the LOH/LOI rates increased to 4 x 
20 1 0-' in adenomas, the necessar>' promotion event might still be a single point mutation 
occurring at a rate of 2 x 10"' per cell division. 

A comment on a common error in this matter of high LOH/LOI levels in tumors 
is in order. Some cancer researchers have used only the number of net doublings in 
adenomatous growth to account for an LOH/LOI fraction of 0.22. This would b e the 
25 log2(adenoma cell number at the end of promotion), which is or about 1 7. But one 
requires the total number of linear divisions between the first adenoma and fust 
carcinoma cell for this calculation. This number is about 567. 
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Robustness of the model 

■ In order to calculate the parameters of our three-stage carcinogenesis model for 
several birth year cohorts, maximum likelihood techniques were avoided by making 
several approximations. We recognize that these approximations might have led to an 
5 inadequate determination of the actual values. 

Survival data were obtained by observations from a small, possibly 
unrepresentative population [19-26], Table 6 shows the effects of a 10% error in the 
estimate of the relative sun^ival on the calculated parameters for one cohort, 
European-.Ajnerican females bom in the 1880s. It is clear by inspection that the 

10 population risk parameters, Fj, and f^, as well as the initiation mutation rate, ri, would not 
be seriously affected by this range of errors. However, the terms for adenomatous 
gromh rate and mutation during promotion are more sensitive to this t^npt of error. 
Additionally, the reported survival data excluded diagnoses of colon cancer first^detected 
at autopsy. This primarily occurs in the elderly, and may lead to a larger error in the 

15 estimated sundval in the elderly. Table 5 shows the effects of a 10% error in the estimate 
of survival for individuals older than 75. Again by inspection, the population risk 
parameters Fh and ih would not be seriously affected by such an error. 

Table 6. Percentage Change in Parameter Estimxates Given Errors in Data Sets 
20 (Data for European-.^erican females bom in the 1880s were used.) 



*-lO*7& error in SCli.t) 
■ lO^o error in SCh.,t) 



-lO*^ error in SCri»75 
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-lO^c error in slops 
-P-109& error in 
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-t-lO^ error in A-h 
-lO^ error in 

^5 years in trn^ 
-5 yeaxs in t^ax 

-HlO% error in ct— p 
-109» error in ct— p 
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We also tested the robustness of our approach by determining how much an error 
in any one of the several observations derived from the raw mortality data by inspection 
would affect estimates of the parameters (Table 6). At this stage of model development 
and application, they alert us to the uncertainty of our estimates of the promotional 
5 mutation rate, r^, while indicating a general robustness with regard to estimates of all 
other derived parameters. 

■ Generalit>' of the model for other cancers 

We have begun analysis of the data for other cancers and find the primary data 
sets to be as well-behaved as those for colon cancer, (http://cehs4.mit.edu) The 

1 0 application of the general model has led to preliminary estimates of adenomatous growth 
rates and mutation rates similar to those observed in colon cancer. However, the 
estimates of the fraction at risk are not histori cally constant for cancers of the lung, 
stomach and brain, nor for leukemia and lymphoma. The fraction at risk, F^, for stomach 
cancer has shown a monotonic decline for the historical period studied while all others 

1 5 have shown an increase relative to the birth y^ar cohorts of the early to mid nineteenth 
century. 

The form of the function of OBS(h,t) is remarkably similar among cancers, but 
some cancers such as testicular cancer (Fig. 13), Hodgkin's disease, bone cancer and 
breast cancer all appear to consist of tA-o clearly separate groups at risk requirirtg 
29 independent modeling for each group as in Equation 1 3 . 

Further Development of the General Model 

The general model represented by Equations 22 and 34 lends itself to further 
usefiil manipulation and application. These could mclude modeling of diseases such as 
HNPCC which are caused by inheritance of heterozygosity for a recessive but powerful 
25 mutator allele forrany of a set of genes involved in mismatch repair. 

Epidemiologists express themselves in terms of "relative risk". It should be 
straightforward to combine this concept with the general model developed here. A ratio 
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of OBS*(h,t) functions would represent a "relative risk" comparison between 
populations differing in mutation rates, adenomatous growth rates, etc, 
Similarly one could now use this approach to set up a quantitative model for 
probabilities of recurrent cancer in persons surviving a first cancer assuming therapeutic 
measures had no effect on that probability. 
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Appendix (Example 2) 



Av 



0 



OBS (h,i)d\. 



Fu • (1 - S(h,t))_^PQBsM_ 



dt 



j(l - S(h,t)) • PoBS 



'h 0 



-1-1(1 -S(h,t))-PoBs(^'*^ 



• (1 - S(h.t)) ■ Pqt^.s (h'O • ^ 



0 



fit 



j (1 -S(h.t))-PoBs(^'*)^^ 



In order to permit integration, we 
t 



0 



introduce the variable v such that: 



(1 -Fh) 



^ j(l - S(h,t)) PoBs(^'*> 



V = e 



= ._L . (1 - s(h,t)) • PoBS<^^'^) ■ ' 



- J_r(l . S(h,t)) PoBS<^'^^ 

h A 
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This allows us to create a simpisr expression and solve for the definite inrregal: 

. _ rOBS(h,t) ^ f ^h__ih dv = I ^, ^ - 

0 1 



= f • in(F ■ V + (1 - F )) 
h h h 



' = fu • ln(F. +(.1 - F^)) - ■ in(l - F^,) 
0 



= -fh • ^"(^ - ^h) 

This -xpression is Equation 26 in context. Note that by first dividing R(h,t) from 03S(h,t). we 
Lvc IvS needing to characterize R(h,), which is itself not expi:culy mtegraole. 

The useofthem.ximurBvaUi=ofOBS-(h,t) lo define Fj, in terms of and f^,. 

Our third -uation in our explicit derivation of the fraction at risk. F^. cann.e ^'^'^f'' 
observation tVthr^ortality cu/ves adjusted for survival and underreporung reached a 
maximum. Using Equations 19 and 21. we note that: 



OBS (h,t) = ■ r 

fid -S(h,t)).Ki, -{i-L^r^^ 
W 0 

F^ + d-Fh)-' 
FvjK u • (t- Ah) 



Evaluating the derivative at 
observe: 



Fh+ 0-Fh)-^ 

age t = t„ax- derivative of OBS*(h.t) equals 0. ^ 
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^03S (h,t) 



n-2 lOmax), 
(n - 1) ■ Fh ■ • (tmax - • (F^, + (] - ) • g ) 

max (-Th •'^h^ ^ 



Fh ■ Kh • (tmax ■ ) ■ — 



(Fj^ + (1 - Fh) • e ) 



It '(]-Fh)-e 

max = 0 



t 

max 



(1 .Fh)=0 



Eiiminating common terms this simplifies to: 
(n - 1) . (Fh • S -^^^max),^ . p^)) . (,^^^ . ^ 

Evaluating the derivative of l(t), we note that: 
(n - 1) . • (Fh • g-'^'"^^'^^ + (1 - Fh)) - (1 - S(h,t^,J) ■ Kh • (Wx - A^)" (.1 - Fh) = 0 
Solving for the fraction at risk, Fj,, creates Equation 27 

F^, • [(n - 1) • f, • (1 - e'^^^"'^^^) - (1 - S(h,W)) . Kh • (tmax ' ^h)" ^ 
(n - 1) ■ f. - (1 - S(h.tj^ax)) • "^h ' ' ^h)" 



p ^ (n - 1) • fh - (1 • S(h,Wx)) • • (^max " ^h) 

^ (n - 1) • f, • (1 - e'^^'"^^"^) - (1 - S(h.W)) • • (tmax ' ^h^ 



(n - 1) . f, . (I - . ' ° ) - (1 - S(h.t^^^)) (t^,^ - Aj^)" 
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The expected mortality rate from the OBServed disease, PoBS(h,t), within the group at 
risk, Fh 

Calculation of the number of total cell divisions in an adenoma, (t - a) years after its initiation 

To propTly evaluate the promotion mutation rate in terms of mutations per cell division, 
we need to know the total number of cell divisions that an adenoma would have undergone 
between the age at initiation, a, and the age at promotion, t, „ . 

^irst w» consider the number of ceils in the adenoma. The mitial number of cells m a 
suiviving adenoma i; not one, but ay(a - P) . (This is because of the stochasdc x^distribm.on of 
sun'iving cells among all initiated adenomas into the surviving aDenomas [.] ) The colony also 
grows ara rate of (a - P)?er year, such that the numoer oi cells in the adenoma after (t - a) 

g , 9(t-a) (a - 

years is: 

Sine the division and death rates of an adenoma are approximately equal [5], we can 
divide a eir into a periods. We can then write an equation for the number ot cells in the 
adenoma as a function of the number of these periods that have elapsed. 5: 

The numJoer of total cell divisions having occurred in the adenoma can then be related to 
^he rumber of cells. In order to have a colony of a certain-size, N.denoma. we recognize that 
there must have been (N,,,.o.a - 2) divisions wiihin the last period of possible division: 



5 



(a-P) 



a 

a 2 



a-[5 2 

Conseauently. the total number of cell divisions is the sum of the number of divisior.s needed to 
?ve each onhe intermediate sizes of the adenoma up to the last period, 5. 
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(a- p) 



(a-P) 



i = 1 



a 



a 



a(t-a) 
— ^ g 

a-l 

I = 1 



a 



(3- 



This summation can be solved explicitly as: 

-p/a^ (a-p)(t-a) 



a-p 



(2 



1) 



(a-S)/a 
2 . 1 



We actually find ihat talcing the integral instead of the summation above gives us a reliable and 
easier to remember estimate: 



a 



2 ^^(a-!3Xt-a)^^^ 
2 in 2 



Last, we recosnize that for every division, each of the two daughter cells can acquire the 
promotion mutation. Therefore, the number of opportunities for promotion after (t - a) years is 
just twice the total number of divisions: . . 



a 



9 (a-(3Xt-a) 
. (2 -1) 



In 2 



as included in Eauation 21 , 



The growth rate of adenomas, (a - P). 

In order to estimate the initiarion and promotion mutation rates (Equations 2.6, 30), we 
need to know the average growth rate of an adenoma. While the division and death rates, a and [5 
respectively, can be determined in vivo, the difference (a - (i) could not, as it is too small to 
estimate with any useful- precision. The adenomatous growth rate can, however, bs estimated 
directly from. the mortality curves. We will illustrate this using the case (n=2, m=l). The 
calculated adenomatous growth rate is approximately the same for all other cases. 

For ages below 54 years, we can estimate the adjusted observed mortality rat& OBS (h.t) 

as: 



OBS*(h,t) = OBS(h,t) -rTRCh.t) (1 - S(h,t))] - Fj, PqbsC'^^O 
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(a-E3)(t-a) 



1 ° -P 



c c 



PQ3s(h,t) = 2t rjr^ JaNg-^ 



In 2 a 



« 0 

As a first approximaiion. we assumed a constant number of cells in the target tissue. For clarity, 
we group those parameters that do not vary with age (e.g. Fj, is included in Cj): 

(a-p)(t-a) 
t - • (2 - 1) 

03S (h,t) = q Jail^— 

0 

To help us solve this integral, we recognize that / = 1 + x when x is small. This yields: 

t (oc-P)(t-a) t (a-p)(t-a) 

d(l - 0 - C, • (2 -_])) • „ c,CU fa ^ , ^'^ d& 

OSS (h,t) = q la rf(t - a) - ^> 

0 ' 
= C ;arJ^-^^^"^\-P) ln2].a - ,2 ^""^^^ - 1 (.-P)ln2 - .1] 

" 0 ~ * 

^ • .A th. t^x;n constants C,Co = C. If we now taks the derivative of OBS (h.t). we 
whsre we comoineQ the two consi-nti, ^1^2 

observe: 

£(obs1m1 = c [2 - 1] = C 2 ^^"^^^ 

The log, of iOBSnh.t) . is a function of t whose slope is the adenomatous growth rate. 
a-|3: 

d(OBsVt)) ^ H. ,og,(c) 

• • vA nniv wh'-n 2^^'^^^ » 1 SO when estimating the adenomatous 
This approximation is vahd only when z » i. ^ i„ , n,.. ia 

^ «f A on- npM^ to be careful not to use data from the age groups below agw 14 
^^"'T.'ZZlZToZ.^ci adenomatous growth rate .ccounting tor oell number 
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increases during childhood. When we calculate the derivative of OBS (h»t) using the data for t> 
17i, we see that: 

^(OBS^h,t)) ^ . (a-f3)t 2^^'^^"^"^^ , ^ ^ , (a-P)t 

7t .16.5(a-P) r.^ ^ 

The log2 of this function plotted vs, t is also a straight line as shown in Figure 18. The slope of 
this function is (a - p). 



The average promotion mutation rates^ r^. (rn^l) 



In calculating the promotion mutation rate, r^, for the case where only one mutation was 
required for the promotion of an adenoma cell, we used the approximation that the cumulative 
probability would be one-half after Aj^ years, where represents the average time between 
initiadon and promotion. Using Equation 21, this means: 



1 = 1 

2 ^ 



log. 



a 



a-pj 



In 2 ^0 ^ ^ 



a 



[a-pj 



1 ^c-Pc 



[In (2)]^ ^c- 



However, the assumption that the cumulative probability is approximately one-half at the 
average time between initiation and promotion is true only if the distribution for the probability 
of promotion can be approximated by a normal distribution. As (a - P increases, the actual 
distribution can no longer be approximated be a normal disuibution. In this case, we can use the 
exact calculation to evaluate the expected value for the time between initiation and pronaotion. 
Expected value for a continuous random variable, in our case the time between initiation and 
promotion, t - a, is defined as: 



(t - a)P[t - &]d(i-a.) 



We evaluate the expected time between initiation and promotion to be: 
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A 



a 



1 - e 



la-pj 



1 . «c_-f^'^ ,,(«-PXt-a) 
a 



In 2 



• (2 - 1) 



0 



1 ^(t-a) 



ta-|3J 



1 - C^c-Pc 
In 2 o^c 



a 



1 . c^c-Pc i 



In 2 



^ 

' (a-ji) In 2 

where Ei is the exponential integral function. A computational tool such as Mathernatica™ 
(ExpIntegralEi function) or Matlab™ (expint function) can be used lo evaluate the exponential 
integral function. 

The average promotion mutation rates, r^. m>l 

When a cell in an adenoma acquires the first promotional mutation, this cell has the potential 
to divide and become a distinct colony of cells now containing one promotional mutation. A call 
within this colony is a target for a second promotional event, producing a new colony within the 
adenoma, now made up of cells with two promotional mutations. This process continues until a 
cell acquires all necessary 'm' promotional mutations, thereby producing a carcinoma cell. 
(Figure 21) As was the case for a newly initiated cell, wc must consider the possibility that any 
celf that acquires a new promotional mutation could undergo stochastic extinction before 

developing a colony. , . 

The adenoma itself would thus appear to be a mix of colonies of cells contammg zero or 

-more of the promotional events, and the delay in the rise of the mortality curves. A^. is now the 
sum of the average time between each promotional event. 



initiated cell 



AEe 'tj ' : first promotional event 



Age 'to': second promo'\onal event 



stochastic survival 

growth rate (tt] - Pj). mutation rate r^ 
stochastic survival 

growth rate - ^2)- iTi^tation rate rg 
stochastic survival 

growth rate (03 - P3). mutation rats rc 
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Age (m-l)th promotional 

event 

Ase 't': 'mth' promotional event 
promoted cell 



stochastic survival 

growth rate (a^ - ^^), mutation rate .^^^^ 

stochastic survival 
growth rate (a^, - P^) 



Cancer 

Figure 16. Diagram of promotion for necessary events 

From Figure 1 6, the probability of promotion at age 't^ simply follows as: 

Probability of l^^ promotion mutation by age (a < t| < t) x 
Probability of 2^^ promotion mutation by age (t j < t2 < 0 x 

X 

Probability of (m-1)'^ promotion mutation by age (trr^.o < ^m-1 ^^^^ 
Probability of m^^ promotion mutation ^gs t 



Explicitly, this is: 
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(j-mr 



A 



d(\ - e 



ex. 



2 (aj-PiXti-a) 



1 



In 2 



2 "^2 

a. 



) 



t (-(m-l)r_ 



a 



(t, - a; 
2- (a,-(3,Xt,-t,) 
■ 2 ~ ' ' 



1 



In 2 



3 ^3 



) 



J ^ILU 



2 (^..rP.-i)(^m-r^m-2) 



a 



- 1 cc^- 



In 2 



.2 («™-PmX^.-r^..-2) 



l.'^c-Pc) 




^^1 ^'2^'^3---^^m-l K^twp-n initiation and death, the second 

where the urst promotion mutation occurs at any age. t,. between mu:at on 

..rnrs at any age. to. between the first mutation and deatn. the tnira 
promotion mutation occurs at any age. o. 

W= -na. supp.=a *a> ac,..Uo„ . a ne. P— ^^^^^ 

estimate each cell Vunetic ra;u f ■ . average cell 

.iSh. imagine tha, *=rc exi.u an "-^;^-;"^°'":,rcr„ 

the simpler form: 
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dd 



(-mr 



a 
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i (-{'"-Or. • — ■ 



(a-PXt -a) 



In 2 



) 



(a-B)(t -t ) 
2 - 1 



In 2 



) 



<i(t2 - tj) 



J 



^(1 



(_2. . _^ . i ^ li) 



In 2 



m-2 



, (a-S)(t-t 
a f 2 ' "^-^ - 1 . cx.-P,) 



In 2 



a. 



Ws can now explicitly estimate the average promotion mutation rats with respect to the 
expested time between each pair of consecutive events. Here, we solve for the average 
interarrival time between each promotional event. Again we use the approximation that the 
average time approximately corresponds with a cumulative probability for that promotion 
mutation of 0.5: 
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The total expected delay between initiation and promotion, Aj,, is simply the sum of the delays 
betwesn each promotion mutation. Tor m>l: 



m 
I 
i=2 



ti 1 



] -r 



/ r 



a 



] 



a 



[in(2)] 



c 



This gives us an equation for ths delay in the general case of 'm' promoiion mutations, assuming 
that the order the promotion mutations occur is inconsequential. 

If we instead suppose that each of the promotion mutations leads to either an elevated cell 
growth rate or an elevated mutation rate per eel) year, then there would exist a panicular order 
for the *m' necessary promotion mutations that would be favorable for the early promotion of the 
tumor. We might suppose that if the particular deleterious mutations do not occur early in the 
order of the 'm' mutations,, the individual would not accumulate all of the necessary promotion 
mutations within their lifetime. Using the sam.c logic as above, we can explicitly evaluate the 
delay between initiation and promotion, if the 'm' mutations had to occur in a particular order: 





f 


a 1 


-1^ 


(Tn-l).log2 


\ + 


.'-^ ■ «-P [ln(2)f. 





log. 



g 



[ln(2)] 



These equations allow us to estimate the average promotion mutation rate for the case 
where the 'm' promotion mutations occur in a com^pletely unordered manner and the case where 
the *m' promotion mutations must follow in a particular order. Of course, it is possible that only 
some of the promotion mutations must occur in order For this case, where the 'm' mutations are 
only partially ordered, these, equations would describe the possible range for thie average 
promotion mutation rate. 
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Example 3 IDENTIFICATION OF INHERITED POINT MUTATIONS 

backgrount:) and significance 

This section re\dews competing technological approaches to the identification oi" 
Point mutations (e.g., single nucleotide pol^Tnorphisms) and rare Point mutations carried 
5 by human populations. "SNT" implies the widely-accepted definition of inherited point 
mutations present in 1% or more of the population (Brookes/ 1999), By Vare Point 
mutations' we mean all point mutations present at allehc fractions less than 1%. We 
have previously argued on theoretical grounds that Point mutations at fractions lower 
than 1% would need to be discovered in order to identify genes carr>ang alleles that are 
1 0 deleterious for reproductive fitness or somatically harmful in adult humans 

(Tomita-Mitchell et aL, 1999). Based on eariier obsen^ations by population geneticists, 
we beheve it is both useful and necessary to scan large samples of cross-sections of the 
ethnic groups which constitute the American population. 

Significance of inherited human mutations. 

1 5 Discovery of deleterious alleles. 

As a general proposition a true fine structure map of inherited human point 
mutations would allow classification of each gene mapped with regard to the presence or 
absence of dominant or recessive deleterious obligatory knockout alleles. At this writing 
oblieatory knockout alleles axe limited to stop codons and frameshift mutations. But the 

20 growing knowledge of gene-inactivating splice site mutations should allow their use as 
obligatory Icnockouts also. Similarly, increased knowledge of protein structural motifs 
which inactivate gene product function should allow certain missense mutations to be 
recognized as probable, if not obhgatory, knockout mutations in the near future. These 
data would also lay the groundwork for determining the fraction of the total human 

25 genomic complement which carries such deleterious alleles. 

Studies in human reproduction indicate that about 0.75 of all human coraceptions 
' - are lost prior to birth, many in pre-implantation or early post -implantation losses 

unrecognized by the mother. About 0.30 are attributable to aneuploidy or chromosomal 
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absrrations. Thus about OAS of all human conceptions are lost due to unknov^oi causes 
' (Liber and Thiily, 1983). We have explored the theoretical possibility that all or a 

significant fraction of these fetal losses could be due to dominant deleterious alleles or 

homozygosity for recessive deleterious alleles. 
5 Dominant deleterious alleles appear to arise at a fraction of 3 x 1 0"' per gene at 

risk (Cavalli-Sforza, 1971). It would thus require (0.45 / 3 x 10"') = 15,000 of such 

dominant deleterious alleles to account for all of a fetal wastage fraction of 0.45. 

Recessive deleterious alleles are expected to be carried by 1.33% of the population. On 

average for each gene at risk given a forv^^ard mutation rate for gene loss of 3 x 1 0*^ (the 
1 0 Hardy-Weinberg equilibrium calculation), this estimates that the number of such 

recessive deleterious allele carrying genes would be (4 x 0.45)7(0.0133)2 ='10,130 to 

account for all of a fetal wastage fraction of 0.45. 

It is not unreasonable to consider the possibility that all, or a large fraction, of 

actual fetal losses arise from a combination of fewer than 10,000 recessive and fewer 
15 than 15,000 dominant deleterious allele-cairying genes in addition to the fraction caused 

aneuploidy and or chromosome aberrations well described by cytogeneticists. 

Aleebraically, the sum of the number of genes caroling either recessive, NR, or 

deleterious, ND, alleles are related to this estimated upper bound on fetal wastage: 

3 x 10-^ ND + 4.4 xlO"^ NR = 0.45 
20 Of course, the attribution of all fetal wastage to inherited deleterious alleles 

provides an upper estimate on their nmnbers. These estimates are, however, related to 

estimates of the number of essential genes in yeast, zebra fish and mice which range 

from 5000 to 15,000 at present count. 

The clinical importance of these considerations extends to consideration of 
25 asymptomatic infertility. It is possible that a significant fraction of infertile couples with 

no chnical indication of reproductive dysfrmction may be found to carry multiple 

conditions of complementing heteroz}'gosity . For instance if a couple were 

heterozygous for two identical recessive deleterious alleles, 7/16 of their conceptions 
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would be lethal. If heterozygous for three identical recessive alleles, 46/64 of their 
conceptions would be lethal and so on. 

This "Fermi calculation" is of practical value in that it suggests that the scanning 
of all expressed genes in the human genome would uncover fewer than 10,000 genes 
5 which carry- recessive deleterious alleles and fewer than 15,000 carrying dominant 
deleterious alleles. Assuming 50,000 such expressed human genes, this works out to 
expected fractions of less than 20% and 30% respectively. For a proposed random set of 
15 genes we might thus expect 3 or fewer to show a set of obligatory knockout alleles 
with fractions summing to about 0.3% permitting the inference that such a gene carries 
10 recessive deleterious alleles. Similarly we expect 5 or more to show no obligatory 
knockout alleles permitting the inference that such a gene carries dominant deleterious 
alleles. If we assume 100,000 random genes, these estimates are lower accordingly. 
But we have not chosen a random set of genes. We propose to examine 13 genes whose 
gene products have been identified as constituents of the nuclear DNA repair complexes. 
15 Since obligatory knockouts of such genes are expected to possibly be dominant 

deleterious and certainly to be recessive deleterious alleles, we expect some of these 
genes to carry obligatory knockouts summing to 0.5% or less and some to show no such 
alleles down to our limits of detection (see Table 8, Studies). 

Discovery of somatically harmful alleles. 

20 It follows that such a fme structure map drawn from juvenile populations would 

contain somatically harmful alleles such as those which might increase somatic mutation 
rates and hasten the appearance of cancer or atherosclerosis. Studies to discover such 
alleles would involve additional analyses of proband populations for specific diseases or 
extremely aged populations in which carriers have been reduced by early mortality. 

25 However, the value of the proposed study in juvenile populations would be a necessary 
first step in such explorations. 

Discovery of alleles specific to ethnic subpopulations. 
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It is clear that iiihsrited po]>TOorphisms at fractions greater than 1% vary among 
the major demographic groups of the world and it may be reasonably expected that such 
important variations will exist for deleterious alleles present at fractions less than 1%. 
We have already found such an example in the HPRT gene exon 6. Our initial emphasis 
5 on a large ethnically mixed sample of 50,000 juveniles will allow us to identify any 
allele present down to an allele fraction of about 5 x 10'^. This strategy is designed to 
increase the number of obligatory knockout alleles in the initial sample and to lay the 
groundwork for future studies aimed at particular demographic groups. 

Technological background: high resolution mutadonal spectrometry. 

10 The approach described herein represents a clear and significant improvement 

over present and even proposed strategies. The method described herein is extremely 
efficient for mutarion discovery and frequency estimation, because it can analyze pooled 
genomic DNA from a hundred thousand or even a million individuals. This is in contrast 
to other presently employed techniques which require assaying each individual, or small 

15 pools of at most about a dozen individuals (Trulzsch et ai., 1999). Since the cos: per 
individual is not trivial, this makesmutanon discovery in large populations very 
expensive, and these studies are required for determination of low-frequency point 
mutations with useful statistical precision (Hagmann, 1999). 

There are two alternative approaches to the use of large pooled samples analyzed 

20 by high resolution mutational spectrometry as proposed herein. 

The first is the sequencing of megabases of genomic DNA from a limited 
number of individuals. This is the "resequencing" strategy employed by a number of 
private companies and various NTH or DOE sponsored university laboratories and 
human genome centers. This approach was made possible by construction of research 

25 facilities to sequence the human genome. High throughput was nattu-ally defined as the 
length of DNA sequences that could be sequenced per year. The cost per base pair of this 
approach' is presently estimated to be about 20 cents. But we and industry analysts 
estimate that reasonable improvements in processing samples and an increase in scale to 
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analyze thousands of donor samples would reduce these costs to about 2 cents per base 
pair. 

It would thus cost some five hundred million dollars at present cost levels 
charged to InQH grants to examine 2.5 billion base pairs. The method described herein 
5 will cost less than three million dollars. Furthermore we focus on exons and splice sites 
in which most deleterious mutations are expected. In genome resequencing these 
sequences rarely comprise more than 5% of the length of DNA sequenced increasing the 
relative cost of identifying inherited deleterious alleles in any chosen gene. 
Unfortunately, the use of even a thousand blood samples will not reveal recessive 

1 0 deleterious mutations such as many of those causing cystic fibrosis and a suspected 
several thousand forms of fetal loss. This is because the sum of fractions of such 
deleterious alleles is expected to be about 1.33% since individual alleles are expected at 
fractions ranging from 0.3% to 0.003%. Examination of 100,000 alleles should give a 
worthwhile picture of these alleles and our first studies of a few thousand individuals 

1 5 offer support for this expectation (Tomita-Mitchell et al., 1999). Since the cost per 

individual is not trivial, the usual methods of mutation discovery in large populations is 
very, expensive. We argue that our studies are required for determination of 
low- frequency point mutations with useful statistical precision (Hagmaim, 1999). 
The second alternative technology (to the use of high resolution mutational 

20 spectrometry on large pooled- sample) is the use of microarrays of short DNA sequences 
(DNA chips) created to detect any and all point mutarions in a given DNA sequence. 
' Theoretically these could focus on a sample of a 100 bp sequence to detect any of the 
300 single base pair substitutions and 200 single base pair additions or deletions. 
Practically the use of up to 16-mers has indicated that only somewhat greater than half 

25 of all point mutations are detected in this w^ay and that the detection of variants at 

fractions lower than 10% is not yet feasible using this technology (Paul Berg, Standard 
University, personal communication). Coupled with the cost of each "chip" and creation 
of a chip for each sequence to be scanned and the need to prepare and analyze blood cell 
DNA samples for each person, we do not think that the task of identifying deleterious 
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inherited point mutations over a large number of genes and people can be accomplished 
in this way without a truly massive investment. 

In short, given a means to accurately measure the fractions at the desired 
sensitivity, the use of large pooled samples appears to be a good way to determine the 
5 type and numerical distribution of inherited point mutations in human populations. Our 
proposed approach using constant denaturing gel capillary electrophoresis (CDCE) to 
separate mutant from wild type sequences prior to DNA amplification has already been 
• demonstrated to have the requisite sensitivity. Furthermore, when target sequence sizes 
of 80-120 base pairs are used, 100% of all point mutants studied have been separated as 
1 0 mutant/wild type hetsroduplexes from wild type homoduplexes (Mickey et al., 1 995, 
Khrapkoetal.,1998). 

One should note separation of wild-type homoduplexes from mutant-containing 
heteroduplexes, which we perform using constant denaturing capillary electrophoresis, 
may also be accomplished by HPLC and similarly predetermined temperature 
1 5 conditions. The physical-chemical principles of mutant separation are similar to those in 
CDCE: partially melted heteroduplex molecules containing a mutant sequence are 
retarded in columns relative to wild type homoduplexes, Alternatively, many, but^iot 
all, single nucleotide changes can also be detected using single strand poljonorphism 
velocity variations in capillary electrophoresis (CE-SSCP) (Gonen et al, 1999). _ 
20 The mutant separation method we propose involves constant denaturant capillary 

electrophoresis (CDCE) to obser\'e somatically-derived point mutation spectra in human 
tissues (Khrapko et al., 1994; Khrapko et al., 1997; Li-Sucholeiki et al., 1999). The 
method coupled with high fidelity DNA amplification has been demonstrated to detect 
mutations in 100 bp sequences with a sensitivity of at least 10"^ in samples from several 
25 human organs (Khrapko et al., 1997, 1998). 

CDCE is based on mobility differences among partially denatured 
double-sfranded DNA fragments. These mobility differences arise from differential 
cooperative melting behavior among wild type homodiiplexes and various 
heteroduplexes formed betv/een the majority wild type sequences and minority mutant 
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sequences in a complex sample such as the pooled blood samples proposed herein 
(Thilly, 19S5), The high resolution achieved by capillary electrophoresis, as well as 
elimination of artifacts arising from interaction of DNA with standard cross-linked 
polyacrylamide gels, resulted in a means to detect and sequence rare muiants in complex 
samples (Khrapko et al., 1994; Khrapko et al, 1997; Salas-Solano et al., 1998; Muller et 
al., 1998; Pennisi, 1999). 

Automation. Capillary separation systems offer the advantage of complete 
automation of sample loading, separations, fraction collection and subsequent DNA 
amplification of selected collections. Automated sample injection and replacement of 
po]>rmer separation matrices is being used in capillary array instruments to conduct the 
large-scale sequencing required to determine the human genome (Mullikin and 
McMurragy, 1999). In this proposal we introduce our already- developed peak, or 
sam.ple collection system coupled to automated PCR. 

Role of rad54 in homologous recombination. 

One long-term goal is to determine whether the suspected dominant and 
recessive deleterious and/or harmful alleles uncovered by the research actually impart a 
mutator phenotype to a human cell. Our approach is to create heterozygous and/or 
homozygous cell lines at the genetic loci identified to be potential Human mutator alleles 
and evaluate their spontaneous mutation and mitotic recombination rates in the HPRT 
and TK loci. 

An immediate goal is to evaluate the relative efficiencies of two protocols for 
gene replacement, which can be used in later experiments. The genetic locus we have 
chosen to work with is rad54, which appears to play a central role in homologous 
recombination and repair of DNA double strand breaks. We hypothesize that mutations 
in rad54 will result in the cell having a higher spontaneous mutation rate and becoming 
hypermutable to agents such as ionizing radiation. 

in S. cerevisiae, there are at least 9 different proteins in the rad52 epistasis group, 
which is involved in homologous recombination, as it affects the repair of double strand 
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breaks (Haber, 1995; Bai and Symington, 1996; Baumann and West, 1998). 
Mammalian homologues have been identified for several of these, including rad54 
(Kanaar and Hoeijmakers, 1997). In the mouse, disruption of rad54 leads to increased 
radiation-sensitivity, and decreased homologous recombination (Essers et al, 1997). 
5 From these experiments, we leam that rad54 is not an essential gene; this is important as 
it indicates that a knockout in human lymphoblast cells should not be lethal. The human 
homologue is a functional one, as it complements a repair-deficient rad54 yeast mutant 
(Kanaar et al, 1996). The protein is a double-stranded DNA dependent ATPase, but its 
precise function in recombination is still undetermined (Swagemakers et al, 1998). 

10 There is evidence that rad54 alterations may be involved in carcinogenesis. Out of 132 
primary tumors examined, several mutaiions in rad54 have been identified; these include 
a GGA AGA (gly arg) in a breast tumor, OCT ^ CAT (pro his) in a colon cancer, 
and GTG GAG (val - glu) at codon 444 in a l>TOphoma (Matsuda et al, 1999). The 
mutation in the colon cancer was clearly somatic in origin, as the surrounding tissue had 

1 5 the wild rynpc sequence. The observation that a rad54 mutation occurred in 1/24 
lymphomas examined has an important implication for our proposed study, as it 
provides relevance for our work in l>miphoblastoid cells. 

The significance of the proposed work is that we will evaluate the effect of Point 
mutaiions and rare Point mutations actually found in the human populations sampled on 
20 mutation states in human cells. Tnis emphases in phenotype evaluation and focus on 
DNA repair genes has been made in response to the reviewers' criticisms. 

STUDIES 

The studies proposed herein require a sensitivity of only 5 x 10'^ to detect 5 or 
more inherited point mutants among 100,000 alleles. This sensitivity is possible because 
25 we use Pfu DNA poljmierase under conditions in which PGR error is obsers^ed to be 
about 6 X 1 0'*^ errors per base incorporation and because we limit our initial PCIR 
amplification to 5 doublings (Andre et al., 1997). This limits the background noise due 
to PGR to about 6 x 10"^ x 5 = 3 x 10'^ mentations per base pair prior to separation of 
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mutant from wild-type sequences. Detection on the capillary is accomplished using laser 

- induced fluorescence from a fluorescein-labeled PGR primer that also serves as a high 
melting temperature "clamp" necessary for CDCE separation. Once we could detect low 
fractions of point mutants in simple reconstruction experiments we had to work out a 
5 method to process real human cells and tissues. Because mutant fractions were so low 
we had to start with cell numbers in excess of 10' or more than 6 mg of genomic DNA. 

The procedure we developed mvolves isolating the total DNA (up to 1 0 mg) by 
simply digesting with proteinase K, SDS and KNase followed by centrifugation and 
ethanol precipitation. This gives us >90% yield in a protocol with iterative extractions. 

1 0 We digest this mass of DNA with a pair of restriction enzymes, which liberates a 
fragment containing our desired sequences. We hybridize with an excess of 
biotin-labeled probes to both the Watson and Crick strands of the desired sequences. 
This allows us to capture the hybrids from the bulk DNA using streptavi din-coated glass 
beads with a particularly low affmity for non-specific DNA binding. These are- then 

1 5 amplified using hifi PGR. A very important characteristic of this method is that it 

pemiits us to remove many thousands of DNA sequences from the same pooled samples 
since DNA removed from the bead washings are returned to the original sample. 
We recover more than 70% of our original copy niraiber after elution from the 
streptavidin coated beads. The enrichment for a single copy nuclear gene Was 10,000 

20 fold as determined by comparison to cany over of multicopy mitochondrial sequences. 
The sequence we used for this development was a 255-bp sequence of the human APC 
gene, cDNA bp 8429 - S6S3. It has served as an example.of a nuclear sequence with 
juxtaposed high and low temperatore isomelting domain. Mutations are detected in a 
104-bp sequence between bp 8560 and 8663 in the low melting domain. Our current 

25 detection limit of these APC gene mutations in human cells and tissue samples is lO"' 

(Figs. 22A and 22B). 

There are optimized procedures for controlled polymerization of linear 
polyacryiamide (Goetzinger et al., 1998). Using eimulsion polymerization, linear 
polyacrylamide (LP A) of denned chain length can be tailored for any type of D>NA 
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separation, and stored indefinitely in powdered form. For example, long chains (10 to 17 
MDa or greater) alone and in combination with short chains (50 to 250 KDa) are 
advantageous for long DNA sequencing read lengths (Salas-Solano et ah, 1998; Zhou et 
al., 1999; Carrilho et al, 1996), while a medium size (-- 1 MDa) polymer is suitable for 
the separation of short (100-500 bp) dsDNA fragments (Berka et al., 1995). The 
capability to produce a variety of molecular masses of LP A will allow optimized 
compositions for the variety of tasks to be undertaken (e.g: PGR product analysis, 
CDCE, sequencing). 



Mitochondrial Mutations. 

1 0 We have also measured mutational spectra in the mitochondrial sequence bp 

1 0,030 to 10,130 in multiple tissues, tumors, and human cells in culture. This task was 
substantially easier than the nuclear DNA studies described above because there were 
about 400 rather than two gene copies per cell and initial isolation of the desired 
sequence was not required. But these studies gave us substantial experience in handling 

1 5 DNA from human blood and organs. We discovered the same set of 17 mitochondrial 
hotspot mutations in all samples indicating a universal process of human mitochondrial 
point mutation. The mutations observed consisted primarily of both kinds of transition 
mutations with two trans versions. A sensitivity limJt of 10'^ was required for these 
observ^ations. Mutations were demonstrated in both "Watson and Crick" strands, a 

20 necessary step to control for mismatch imeimediates or DNA damage being mistaken for 
mutations (Khrapko et al. 1997; Colier et al., 1998). This research demonstrated our 
abihty to determine rare mutations in which as many as 20 separate mutations were 
found in a single 100-bp target sequence. These studies establish the fact that point 
mutattions in the human genome can be and have been detected, isolated and sequenced 

25 down to a frequency of 1 0'^. 

Study of 1000 pooled blood samples 

We chose three approximately 100 bp sequences in the large exon 15 of the APC 
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gene, one from exon S of the TP53 gene and six from sxons 2,3,5,6,8 and 9 of the HPRT 
gene and one mitochondrial DNA sequence. We defined conditions for high fidelity 
PGR and CDCE separation. We also created internal standards for each sequence. 

In our first trial, we collected two separate groups of blood samples from 1000 
5 juveniles from the Boston Lead Laboratory. We created two separate pooled samples 
from each separate juvenile set, These four pooled samples, two independently 
constructed duplicate pooled samples from each of the two separate juvenile sets were 
assayed for the ten nuclear DNA sequences and one mitochondrial DNA sequence. The 
results are summarized in Table 1 (see next page). 
1 0 Of the nuclear sequences scanned six inherited point mutations were discovered 

and one was found in the mitochondrial sequence. Two high inherited mutant fractions 
were found in a sequence of the APC gene exon 15 (11%) and in the mitochondrial 
sequence (18%). Fo^or separate mutations have been found but not yet sequenced in exon 
9 of HPRT each with mutant fractions between 10-3 and 10-2. Since a total of 4000 
alleles were scanned for the autosomal genes APC and TP53, the absence of any signal 
indicated that no inherited mutations were present in these samples at a level 2.5 xlO-4. 
Since the HPRT is X-Hnked some 3000 alleles were scanned in these mixed gender sets. 
One mutant in exon 3 of HPRT was later traced to persons of Afiican-.^merica^ 



.15 



heritage. 



20 
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Table 7. Results of scannmg len nuclear and one miiochondrial ;> 100 bp sequences for 
inherited poim muiaiions by CDCE/hifiPCR in rwo separate pooled blood snmpics each 
derived from 1 000 juveniles. 



I Gene 

A?C exon 15 



I APC exon \S 
_ AFC axon 15 
T?S3 sxon S 
I HPRT gxon : 
HPRT gxsn 3 
H??.T cxon i 

H?KT sxon 6 



! Scnnned seouence (oositionl 
I 121 bDrcDNAbDg543>8663) 



I AileJic frflcrtnn 



HPRT exon 8 
HPRT exon 9 



I 113 bsfcDNA bp 3876-3988^ 
1 109 bpCcDNA bDH332-;^4 0^ 
I 131 bprbp 1446M44 59n 
' 72 bp fcDNA DP 34-105^ 
_ SO bp fcDNA bp i35-2l4 i 

66 bp (cDNA bp 3S3-402 - 9 op i' iniron -r ^0 

bp 3' inrron^ 

" U2 bp (cDNA bp 403-4S3 - 3 bp 5' inuon - 6 

bp 3' inrron) 

96 bp (cDNA bp 53S-609 -r 23 bp 3' mrron}'" 



I GC T A at cDNA bp 8643 { 1 1 % ) 
I Noner>2.j xlCi 



. i None f>2..5 xlO'^) 
' None f>2.i xlO"1 



J Nong r>3.3 X'10'^'i_ 
J. None f>3.3 xiO'n 
None{>3.3xl0-') 



OC ^ AT a: cDna bp 4 SO (6x10-). only in 
Arncan Am^rigsng 




10 



15 



We regarded these first obsen^ations as imponant not only for the mutants 
obser^^ed but for the fraction of 1 00 bp sequences which did NOT contain Ynheri ted point 
mutations. 7 of 10 nuclear sequences were vacant with regard to imherited point 
mutations at fractions 3.3 xlO-4. This was extremely significant since it indicated that 
the discovery of individual recessive deleterious alleles expected to arise at fractions 
from about 3 x 10'" and lower would not be impeded by the presence large number of 
non-delsterious alleles. 

In Fig. 24 v/e show the kind of reproducibility observed for all eleven sequences 
scanned using a sequence from exon 15 of the APC gene as an example. What is shown 
is a CDCE run in which the mutant sequences are nin as homodupiexes just prior to peak 
isolation for sequencing. By the use of the internal standard introduced into the original 
sequence isolate, it was possible to obtain estimates of original mutant fractions. 

In our first direct studies of potential ethnic differences wiin regard to rarer 
inherited point mutations we examined almost 1900 alleles of juveniles of 
African-American and Hispanic -Am eric ah' origin. These samples came from Lhe New 
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York Lead Laborator)'. As may be seen in Fig. 24 an inherited mutation was found in 
the African- American group which was absent in the Hispanic- American group. 

Sometimes our raw CDCE data creates confusion and reasonable reviewers want 
to see what we really see as we first amplify our genomic samples with fluorescent 
5 labeled clamp sequences as primers. The steps are shown in Fig. 24. First we check to 
be sure there is no significant PGR "noise" in the sample by amplifying DNA from the 
human TK6 cell through some 30 doublings. 

Preliminar)' data from MIT- Shanghai Cell Biology Institute Consortium. 

The observations from the consortium supported the obsen^ation that the number 

10 of distinct point mutants in any 100 bp sequence which arose at fractions of either >2.5 x 
2Q-4 ^^Q-A nfg^ far betvr^een". This result is extremely important since it 
indicates that both non-deleterious and recessive deleterious alleles should be easily 
recognized in large pooled samples as proposed. 

Pooled samples from 5000 Han Chinese juveniles or 10,000 alleles for autosomal 

1 5 genes were obtained. The sensitivity of the procedure is such that even a single point 
mutant sequence would be detected. In Fig. 27 the results for the beta globin gene are 
presented as the mutant fractions for the mutations observed plotted at their sequence 
positions in the genes. A similar quantitative distribution has been observed for the , 
larger alpha- l-antichymotryT" sin gene. 

20 In a parallel effort some 41 STS were scanned as examples of sequences in 

which mutations would not be expected to effect reproductive fitness. These sequences 
had already been found by the MIT- Whitehead Institute genome center to contain point 
mutauons at 25% or higher in the several dozens of individual genomes scanned. Using 
the pooled sam^ple approach, an additional 21 point mutations (1% or higher) v/ere 

25 found, but only three mutations in the frequency range from 0.01 to 0.001 . These data 
r > extend the observations that the number of rare inherited point mutations is small. 
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Discover which genes have obligatory ioiockout alleles at fractions that pennit 
inferences about the importance of such genes in determining reproductive fitness. 
Identify and note the fraction of genes scanned which appear to carry either dominant or 
recessive deleterious alleles. 
5 Some reports have express the behef that most mutations causing hereditary 

diseases are missense mutations. This is certainly true for sickle ceil anemia and a 
variety of other disorders but our experience with APC and HPRT mutations led us to 
think that a significant fraction of recessive or dominant deleterious mutations would be 
frameshifis or stop codons. >5.000 recorded point mutations were studied and more 
10 than 40% were found to be of the obligatory knockout (OKO) variety. Here in Table S 
he summarizes tlie results for the genes with the most reported individual mutations. It 
is clear by inspection that OKOs are a general feature of inherited disease-causing 
mutations. Their presence in two or more separate Iciock out alleles each at fractions 
below 1% would permit an inference that the gene in question carries recessive 
1 5 deleterious alleles. 

Table 8. Result of a study of genes and inherited mutations in the Human Genome 
Mutation Database to discover what fraction of inherited disease is attributable to 
obli-gatoiy knockout mutations (R. Wasserkort, MIT). 



wo 00/34652 



- 137 - 



PCT/TJS99/29379 



Result of a study of genes and inherited mutations in the Human Genome Mutation Database to discover 
what fraction of inherited disease is attributable to obiigaiorv knockout muiations fR. Wasserkort, MITt* 



rruu k ii 4i 

1 genes 


1 Totals of aJl 


Ratio OKO to 


Gene name 


Disease (s) 


kinds of 
1 murations 


all point 
Mutations 






APC 


1301 


10.97 


1 Adenomatous polyposis coii 


Adenomatous Doivoosis coli 


IBRCAJ 


\25B 


jO.S2 


[Breast canrer 1 


Breast cancer, ovarian car.csr 


|CYB3 


jlE2 


0.65 


[Cytochrome b-245, ^-poiypspt. 


Chronic granulamatous disease 


3TK 


210 


0.64 


Bruton aEammaelobuii.nasmia 
rtr^rosinc kinase 


Agammaglobulinacmia 


IH5B- 


1259 


0.54 


HaemoEiobin beta 


Thalassacmia bcia. Hacmocicbin variant 


CTTR 


A95 


0.52 


Cystic fibrosis Transmembrane 


Cystic fibrosis. Congenital absence of vas difertns, 






conductancr reculatcr 


HvDenrv'OEinEemia. low sweat chloride 


IDS 


1258 


0.49 


Iduronats 2-£ulDhBt2ss 


Hunter svn drome 


LDLR 


359 10.45 ]Low censip.' iiDODroicin rscrator 


Hvocrcholssreroiasmia 


ALD 


123 10.43 


AdrenoteukodvstroDhv lAdrcnoieukodvstroohv 


CoUAl 


1-3 10.37 iCollacen I aloha 1 lOsreocsnssis imceriscia MV.Ehlers-Danlos svnar. VII 


TYK 


H3 iO.56 ITvrosinass lAlbinisTn. ocuiocutansous 1 




637 10.36 


Factor IX 


Haemorihilia B. Warfarin sensitivirv 


ore 


146 


0.35 


Ornithine carbarn oy Itrans -f erase 


Ornithine transcarbamylase deficiency, 
Hvoerammonaemia 


HPRTl 


104 


0.35 


Hypoxanthine phospho- 
ribosvltransierase 1 


Hypoxantnine guanine phosphoribosyltr. defic, Lesch- 
Nvhan syndroms 


!?AH 


275 10.35 IPhtnvlaianine hydroxylase iPhenYlkeionuria. HvoerDhenvialaninasmia 


TP 5 3 


59 jo. 30 iTumour Drotein 353 


Li-Fraumeni s\'ndrcme. Adrenocortical carcinoma 


PKLR • 


119 


0.28 


Pyruvate kinase (iiver and red 
blood cein 


pyruvate kinase deficiency, Haemohtic anaemia 


RHO (8S 10.14 iRhodcDsin i 


F^-ciiniiis Dicmenrosa. Nicntblindncss 




4059 


0.46 








sum 


ariihm. m-an 
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.Ajiticipated Optimization Studies 

The following description of elements of a high throughput laboratory for point 
mutation discovery are offered as evidence that the teaching of this application can 
reasonably be anticipated to be accomplished for all or a large portion of genes in the 
human genome. 
CDCE optimization. 

To ensure highest enrichment of mutant heteroduplexes as well as highest 
resolution for separation of mutant homoduplexes, CDCE conditions must be optimized 
for each target. The optimal CDCE conditions for collecting S!NT-containing 
heteroduplexes should allow all of the heteroduplexes to coalesce into a single fraction 
well separated from the wild tj^^e homoduplex, while the optimal conditions for 
isolating SNP-containing homoduplexes should provide the greatest resolution among 
the observed homoduplex pealcs. The PGR products amplif ed under the optimized 
conditions Vv'ill be used as the test samples for CDCE optimization. The optimization 
will be performed using the same 24-capillary array CDCE instrument. Using the 
independent temperature control on each column, an 1 TC temperature range in ■ 
increment of 1 *C which covers (Tm - e'^C) to (Tm -f 5'C) will be tested in the first run. 
The optimimi temperature will then be refined by a second multi- capillary run using 
temperature increments of 0.2°C. For those target sequences that a high resolution 
separation can not be achieved by varying the CDCE temperature, lower electric field 
strengths and/or a linear polyacrj^lamide matrix with increased salt concentrations will 
be used to improve the resolution (Khrapko et ah, 1996). 

Evaluation of the data to select the optimum, temperature will be autonnated by 
software to make the required measurements of resolution and peak shape. 
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PGR and CDCE optimization will be perfonned at MIT. The optimized conditions for 
each target sequence will be used for enrichmeiit and isolation of SNPs and rare SNPs at 
the Karger laboratory. 



Example 4 RESE.4RCH DESIGN AND METHODS 
5 Enrichment of mutants. 

The target sequences amplified from the pooled genomic DNA sample by hifi 
PGR v/ill be run on CDCE to separate the fast-migrating wild n'pe homoduplex from the 
slower-migrating mutant heteroduplexes at the optimal tem.perature. point 
mutation-containing mutant heteroduplexes will be collected in a single fraction for each 
1 0 column and thus enriched relative to the wild type sequence. 

Quantitative enrichment of the mutant fraction in pooled genomic DNA is key to 
the ability of CDCE/hifi PGR to detect Jow-frequency mutations (Khrapko et al., 1 997a). 
We will first quantify the copy number of the target sequence enriched from the pooled 
DNA sample containing 1 00,000 alleles by competitive PGR with the same tv^^o artiiicia] 
1 5 mutants used for PGR and CDCE optimization (Khrapko et al., 1997a). Based on the 
target copy number measured, the tw^o mutants will be added at a mutant fraction of 
0.05% to ser\^e as internal standards. The mJxed sample will be amphiied 6 doublings 
..(to minimize PGR noise), boiled and reannealed to form heteroduplexes, and run on the 
CDCE instrument to enrich the heteroduplexes. 
20 ■ ■ In this case, two rows of a standard 96-well plate will be used for collection. 
During separation, the signal in each capillary will be monitored to detect the exit time 
of the wild type DNA from the capillaries. This detection will be performed 
automatically, without need for manual attention. At the moment the program 
determines that the broadened tail section of the wild-type peak in a particular capillary 
25 has reached a given fraction of the peak maximum, the electric cuuent in that capillary 
will be interrupted using computer controlled relays. Once the current to all capillaries 



wo 00/34652 



PCT/US99/29379 



-144- 

has been stopped (zones wilJ elute at different tiirjes in different capillaries and therefore 
the time of stoppage will differ for each of the columns), the capillary ends will be 
moved into the next row of the wells on the coDecrion plate, and all the DNA fragments 
behind the wild type zone will be collected. Diffusion of the DNA from the capillary 
5 during current stoppage is not a concern because this step will be fast (in seconds), and 
diffusion is minimized by the viscous sieving matrix. 

Several iieradons of hifi PGR and CDCE may be required to enrich and amplify 
the mutants sufficiently for the next stage which is isolation of individual mutants. The 
number of iterations will be determined by whether the peaks corresponding to the 
10 inteina] standards are visible in the CDCE data trace at a signal-to-noise ratio above a 
predefined threshold. To aid in automation^ the decision of sufficient enrichment will be 
made by sofra'are based on the ABC expert system that was developed in the Karger 
laboraiory (Northeastern University, Boston, MA) as a base caller for DNA sequencing. 
Sample handling will be done wqth a robotic workstation. 

15 Isolation of individual mutants. 

.After heteroduplex enrichment, mutants must be isolated, quantitated^ and 
sequenced. The combined mutant fraction from the previous step will be amplifsed using 
" - only a few cycles of PCR, so that excess primers will still be present in the final 
reactions, preventing the formation of heteroduplexes. The resultant mutant 

20 homoduplexes will then be isolated by CDCE with fraction collection into the multiwell 
gel plate (again at the previously determined optimum temperature). As mentioned 
previously, the location of the fi-actions on the collection plate will be automatically 
correlated with the detection signal fi-om the LEF detector. The frequency of a mutant in 
the original sample will be measured by comparing the peak area of the mutaat with the 

25 peak areas of the internal standards added in the first CDCE step. Despite the tugh 
resolving power of CDCE, there will undoubtedly be samples where more than one 
mutant homoduplex wall comigrate. Large homoduplex peaks will be tested for buried 
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peaks by melting and reannealing followed by CDCE. The presence of multiple peaks 
under CDCE will reveal any previously hidden mutant. 

In the CDCE multicapillary operation, manipulation of the collected -fractions 
- will be performed with the robot, and the separation matrix will be replaced after each 
5 run. Since both the injection and collection ends of the capillary array must be free, the 
standard mode of the matrix replacement in DNA sequencing (port permanently 
connected to the detection end of the capillaries) cannot be apphed. At the start of the 
research project, the matrix will be manually replaced by a syringe equipped v/ith a 
Teflon joint. At a later stage, a microfabricated matrix replacement port, for automated 
1 0 matrix replacement in all capillaries will be incorporated. This device will replace the 
matrix from the center of each capillary in the array using a uquid junction. 
Sofrw^are to automate data analysis for the fraction-collecting instrument will be 
developed. The program will aid the operator in using the fluorescence traces -from the 
two detection windows on the capillary array to locate rapidly the collected fractions 
1 5 containing particular peaks. The program will also generate reports v^ath the trace 

profiles. In the first year of the proposed work, this software will be supplemented by a 
fully automated procedure to align the traces from the two detection windows using 
either cross-correlation or dynamic time-w^arping (Malmquist & Danielsson, 1994; 
Nielsen et al, 1999) followed by peak detection (Dysom 1990). 

20 Sequencing of isolated mutants. 

After the final CDCE step, the mutants from the multiwell agarose gel plate will 
be automatically pipetted to a roboric workstation with an embedded thermocycler for 
cycle sequencing. Since only short stretches of DNA (<1000 bp) will be sequenced, 
desalting wall not be used as a clean-up step. The samples will simply be diluted 

25 ten-fold with water and electrokinetically injected. We will construct an automated 
capillary array sequencer with column lengths of --20 cm each, since only 100-200 
bases will need to be sequenced. The instrument will be similar to one already built for 
de novo long read sequencing using a 48 capillary array. It will incorporate a line 
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generator with a Powell lens to make uniform illumination of all capillaries. Detection 
will use a CCD camera. This instrument can be easily expanded to a greater number of 
capillaries, as required. The instrument will be dedicated for the many sequences to be 
determined in this project. 
5 The sequencing strategy will employ energy transfer dye terminator chemistry. 

Each isolated mutant will be amplified and both strands sequenced. For one strand, a 
primer will be used to anneal to a section of the high melting domain (clamp) sequence. 
For the other strand, the primer used in hifi PGR will be employed. Along with 
sequencing of mutant strands, appropriate controls will be employed including periodic 
1 0 sequencing of standard DNA sequences. Base calling will be performed using the ABC 
expert system developed for human genome sequencing (Miller and Karger, 1 999). 
Enrichment, isolation and sequencing of mutants will be performed at Northeastern 
"University. 

Softv^'are and informatics. 

1 5 Custom software that will be needed for automating data acquisition, instrument 

control and electrophoretic data analysis. Additional sofrware will be needed to 
coordinate laboratory activities, to centralize information, and to make the results of the 
project accessible. It will be essential to track and coordinate activities for the thousands 
of PCRs, CDCE e]ectropherogram.s, pooled fractions, and DNA sequencing samples 

20 generated over the course of the work. Initially, inexpensive commercial databases and 
internet development tools will be used for these functions, though ultimately a 
commercial laboratory information management system (LIMS) may be required if one 
were to examine a much larger number of genes. Development and maintenance of the 
data systems to be developed will be essential throughout the whole of the proj ect as the 

25 amount of data multiplies and the variety of analyses increases. 

The mutation information emerging from the proposed work will constitute a 
valuable database/ containing both sequence information and extensive annotations, 
following the model of other genetic databases (Brown, 1999). Existing public 
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repositories (Burks, 1999) can also be leveraged as their nimiber and contents burgeon 
over the next few years. These resources will include a very substantial volume of 
information on point mutations obtained by the described methods, which will be used 
to cross-validate our results. Even outside of mutation databases, sequence databases can 
5 be used to detect mutations for this purpose by ilnding mismatches in overlapping 
sequences derived fi-om different genetic lineages (Taillon-Kfiller et al., 1998; 
Picoult-Newberg et al., 1999). An additional use of the databases will be routine 
sur\-^einance to identify additional genes involved in DNA replication, repair and 
recombination, and to update information on kno^^^l genes. This database work will be 
10 accomplished primarily with sequence analysis and data mining sofrv^are already in the 
public domain (Altshul et al., 1997; Pruitt, 1998; Kuehl et al., 1999). 

Samples 

Collect blood samples from a large number of juvenile Americans. Our first large 
sample wnll use all intact samples fi-om a large New York City children's' lead testing 

15 laborator>^ plus a set already collected in Boston. From this and a series of other lead 
testing laboratories we expect to get a sufficiently large number of samples fi-om major 
ethnic groups to allow creation of separate pooled samples. We expect that studies of 
these diiTerent groups will carry a number of idiosyncratic point mutations w^hich will 
increase the number of obligator}^ knockout mutations observed and permit more 

20 conclusive determination as to whether a gene carries recessive or dominant deleterious 
alleles. 

The comparison of difilerent ethnic groups is expected to increase the number of 
different inherited alleles observed for each gene. For instance, we expect to observe 
different sets of recessive deleterious alleles for genes which have come to 
25 Hardy-Weinberg equilibria in Asian populations as opposed to Afiican populations. 
Some alleles may be identical but several in a set of six to ten are expected to vary. 
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repositories (Burks, 1999) can also be leveraged as their number and contents burgeon 
over the next few years. These resources will include a very substantia] volume of 
information on point mutations obtained by the described methods, which wil] be used 
to cross-validate our results. Even outside of mutation databases, sequence databases can 
5 be used to detect mutations for this purpose by finding mismatches in overlapping 
sequences derived from different genetic lineages (Taillon-Miller et al., 1998; 
PicouJt-Newberg et al., 1999). An additional use of the databases will be routine 
sur\^eillance to identify additional genes involved in DNA replication, repair and 
recombination, and to update information on knov^o^ genes. This database work will be 
10 accomplished primarily with sequence analysis and data mining software already in the 
public domain (Altshul et ah, 1997^; Pruitt, 1998; Kuehl et al., 1999). 

Samples 

Collect blood samples from a large number of juvenile Americans. Our first large 
sample will use all intact samples from a large New York Cit>' children's' lead testing 

15 laboratory plus a set already collected in Boston. From this and a series of other lead 
testing laboratories we expect to get a sufficiently large number of samples from major 
ethnic groups to allow creation of separate pooled samples. We expect that studies of 
these different groups will carry a number of idios>mcratic point mutations which will 
increase the nxmiber of obligator^' knockout mutations observed and permit more 

20 conclusive determination as to whether a gene carries recessive or dominant deleterious 
alleles. 

The comparison of different ethnic groups is expected to increase the number of 
different inherited alleles observed for each gene. For instance, we expect to observe 
different sets of recessive deleterious alleles for genes which have come to 
25 Hardy- Weinberg equilibria in Asian populations as opposed to African populations. 
Some alleles may be identical but several in a set of six to. Jen are expected to vary. 



wo 00/34652 



PCT/US99/29379 



-148- 

Seleciion of genes and defining target sequences in initial studies. 

We have been interested in the mutations created during the DNA repair 
processes for a number of years and have a special interest in determining the kinds of 
SNPs present in genes that are involved in human DNA repair. We propose to initially 
5 examine a set of genes knovvoi to be involved in DNA repair and recombination. One 
would imagine that such genes are essential. Tney might carry dominant deleterious 
alleles, but, if not, they must cari}^ recessive deleterious alleles. Findings of any evidence 
of inherited non-deleterious variation in these alleles could be of value in understanding 
variations among humans with regard to mutation rates. Together they offer a look at a 
1 0 multigenic set of factors which could disrupt or alter primary physiologic processes, 
DNA repair and recombination. 

We have selected 13 genes that are essential for mismatch, base excision and 
nucleotide excision repair as well as DNA recombination (Table 9). (Sancar (19.95), 
Nelson et al. (1996), Achaiya et al (1996), Plug et al.(i997), laccarino et al. (1998), 
15 Jeggo (1998), Kolodner and Marsischky (1999), Arbel et al. (1999), Zheng et aL 

(1999)). cDNAs have been determined for each of these and the complete sequences for 
all selected genes permitting us to defme the exons and splice sites are expected during 
2000 when much of the human genome sequencing efforts will be completed. 
As an internal control for detection of non-deleterious and recessive deleterious alleles 
20 we propose to study tv^'o housekeeping genes, Alpha- 1-antichymo trypsin (.AACT) and 
Adenine Phcsphoribosyltransferase (APRT), respectively, both of which are no involved 
in either DNA repair or recombination. AACT is a plasma protease inhibitor 
synthesized in the hver and the disorders caused by mutations in this gene are inherited 
in an autosomal dominant manner. Several poljonorphisms of this gene have been 
25 described and the carriers of such polymorphisms show increased susceptibility to 

Alzheim.er disease (Kamboh, et al. 1995) or Parkinson disease (Yamiamoto et al, 1997). 
The APRT gene will be studied as a control gene believed to carry only non-deleterious 
alleles. Patients with complete deficiency of APRT excrete gravel consisting of stones of 
2, 8-dihydroxy adenine (DH-A.) in urine, but do not have hyperuricemia or gout. Already 
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nine allelic variants of the .4PRT gene axe known (see APRT entrance in the OMIM 
database). 

We will deteimine ihe melting maps (example shown in Fig. 29) and restriction 
maps for all 15 genes lisied in Table 3. This process defines the target sequences 
5 including coding sequences and adjaceni splice sites in tenns of a set of approximately 
100 bp "isomeking domains" bounded by suitable restriction sites. Approximately 98% 
of the human genome is assayable by this approach as estimated by computer scanning 
of 0.5 megabases by melting temperature calculations. Restriction enz^ones, sequences 
of bioiinylated probes and PCR primers will then be determined and tested. 

10 Table 9. List of genes to be s^Jdied. This list is only a subset of all genes that are 
involved in DNA repair and recombination. A list v/ith virtually all genes that are 
presently known (or suspected) to be involved in DNA repair and recombination has 
been assembled and contains more than 100 entries. A subset had to be selected as the 
study of all of these genes is not feasible within the framework of this proposal. The 

15 other genes, however, remain of interest for future work. 



Gene 


Name / definition 


Chromos. 


Genbank 


Complete 


cDNA 


symbol 




location 


access # 


sequence 


length (bp) 


Gene Involved in DNA Repair j=rid/nr 


Re comb Inar ir5r! 






lAPE 


AP endonuciease 


I4qI2 


1 D13370 


1 yss 


i 1021 


ATTvl 


Ataxia tsiansiscissia 


ilq22.3 


1 U33B41 


1 yss 


1 9385 


BRCA] 


Breast cancer, type 1 


I7q21 


L7SE33 


! yes 


5580 


ERCC2 


excision rcnair, comnicmsntation grouo 2 


19qI3.3 


L47234 


yes 


2247 


HEXl 


sxonucicase 1 


lq43 


AF0422E2 


yes 


.. -2411 


MPG 


N-Methyipurine DNA glycosyiase 


16pi3.3 


NM_002434 


no 


1108 


MSH2 


DNA inisTr.atch repair protein 


2p21 


•U03911 


no 


3080 


MSK6 


G/T mismfltch binding protein (GTB?) 


2o21 


U28946 


no 


4264 


POLB 


DNA polymerase beta 


Spl2.pl 1 


D29013 


no 


1259 


RAD54L Ihciicase 11 fRAD54 liks) 


ip32 


X97795 


no 


2607 


UNG 1 


uracii-DKA giycosylase 


12Q23-q24.1 


X8939S 


ves 


973 


XPA i 


XPAC protein, xeroderma pigmentosum, 
ccmpiementarion group A 


9q2232 


D14533 


no 


1377 


XPC tXP-C repair complemenring protein (pi 25) | 


3o25.1 


D21089 


no 


.3558 


Control Gene not Involved in DNA Repair or Recombjlnation 






AACT t 


alpiia- 1 -anti chymctryDsin | 


14c32.1i 1 


KOI 500 


yes 


1324 


APRT 1 


adenine oho snhoribosyi transferase | 


16Q24.3 1 


M16446 


yes 


538 
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Creation of pooled samples. 

We will receive heparinized blood sampies of about 2 or more ml which have 
been stored frozen since sampling for lead determination. We will thaw each sample 
rapidly, pipette rigorously to mix each sample, place 0.25 ml samples in each of two 
5 separate tubes and place the remainder up to 2 ml in -135' C freezer vials which will be 
immediately returned to storage. Thus each individual blood sample will be maintained 
separately for any important future use, testing linkage hypotheses, for instance. 
Since we extract DNA from whole blood, the total white blood cell (WBC) count is the 
relevant parameter to consider variaiions from sample to sample. The automated 
0 hematology reference ranges (95% confidence limits) from the 

MIT Medical Department Laboratory' are 4.8-1 0.8 milhon Vs^C/xnl with a mode 
of about 8 milhon per ml for pediatric samples. Thus the average number ofWBC/m\ in 
five separate samples will be close to 8 x 10^ At our limit of detection of 5 identical 
mutant alleles per 100,000 sampled alleles, it appears we may reasonably ncglc^ot 
5 variation due to the average number ofV\^C per five donors in estimating allele 

frequency. No conclusion in this work would be based on attempting to differentiate an 
allele fraction of 5/100,000 from 10/100,000. 

The tu^o 0.25 ml whole blood ahquots each contains approximately 2,000,000 
WBC or about four million alleles for autosomal genes. We will pool 1000-blood 
samples simultaneously to create nvo independent duplicate pooled blood samples 
containing 2000 alleles each. Blood samples belonging to the same ethnic group will be 
pooled together. A total of 50 such pooled samples each in duplicate will thus be created 
from, the 50,000 blood samples. Each pooled sample will contain about 2 x 109 "WBC. 
We will extract genomic DNA from the duplicated 50 samples pooled from lOOO donors 
each, using a modified genomic DNA isolation protocol w-'ithout exposing DNA to either 
phenol or anion-exchange resins (Khrapko et ah, 1997). The blood cells will be 
precipitated and washed by centrifugation and resuspension in TE buffer (50 mM 
Tris-HCl, pH 8.0, 10 mM EDTA). The genomic DNA will be isolated from the blood 
cells through digestion with proteinase K (1 mg/ml) and SDS (0.5%) at 50'C for 3 
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hours, RNaseA (0.1 mg/ml) at 3TC for 1 hour, and then centiinigation and ethanol 
precipitation. This protocol usually provides a greater than 90% yield of genomic DNA 
from 105 to 109 cultured cells or milligrams to grams of tissues (Khrapko et aL, 1997; 
Li-Sucholeild et al., 1999). The UV absorption ratio of A260/A280 is typically 1.5 - 1.8. 
5 The isolated genomic DNA can be readily digested by restriction endonucleases. 

These genomic DNA samples will be held separately at -135*C to create subsamples as 
required. This will allow us to have sets of samples from each ethnic group and sets of 
unkxiov,m ethnic origin. Aliquots of the 50 genomic DNA samples each in duplicate will 
be pooled to create the final Master pooled DNA sample containing 100,000 alleles. 
10 Sample pooling and DNA isolation will be performed at MIT. 

Defining unit processes for individual target sequences. 

In this initial study of 15 genes v/e anticipate that as many as 150 target 
sequences each consisting of -1 OO bps will be processed. 

We will optimize the high-fidelity PGR and CDCE conditions for each target 

15 sequence using the automated 24-capiIlary array CDCE instrument and a temperature 
gradient thermocycler in a 96- well plate coupled with a robotic workstation for 
■ pipetting. The defined conditions will be integrated into the automated imit processes for 
high-throughput enrichment and isolation of SNPs and rare SNPs in the 15 genes. 
PCR optimisation will be done in tbj-ee stages as previously described. In the first stage. 

20 we will apply our standard hifi PCR conditions to all of the target sequences. The only 
PCR parameter to be adjusted for each target sequence at this stage is the reannealing 
temperature which will be set at about 5^C lower than the melting temperature of 
individual primer pairs. For target sequences which can be amplified with a high 
efficiency 50% per cycle) and low levels of nonspecific amplification and byproducts 

25 ( 1% of the desired products), we will immediately proceed to CDCE optimization. 

■ ^ The standard hifi PCR will be performed in 20 - 30 1 using native Pyrococcus 

furiosus (Pfii) DNA polymerase with an associated 3' 5' exonuclease activity and the 
temperature gradient thermocycler in a 96-wel] plate. About 50 ng of genomic DNA 
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isolated from TK6 cells will bs mixed with the two artificial mutants for each target 
sequence at a mutant fraction above 10%, whJch will be used as the template. The 
standard PGR mixture v.dll contain 20 mM Tris-HCl (pH 8.0), 2 mM MgC12, 10 ml\l 
KCl 6 mM (NH4)2S04, 0.1% Triton X-100, 0.1 mg/ml BSA, 0.1 mM dNTPs, 0.2 M 
5 each primer, 0.1 U/ 1 Pm. 

The target sequences which do not have acceptable PGR products in the first 
stags will be subjected to the second stage of optimization. In this stage, different 
reannealing temperatures and reannealing times will be tested to improve the efficiency 
or specificity of PGR. Some sequences may have a high level of PGR bj^Droducts which 
1 0 consist of incomplete or exonucleolytically processed products missing one to several 
nucleotides. The formation of these byproducts is usually sequence-dependent and 
associated with the properties of the polymerase. Our past experiences have shown that 
the majority of these byproducts can be significantly reduced by incubating the final 
PGR products at 45' C for 15 min or with a small amount of fresh Tfu at 72'*C for lOmin 
15 (KJirapko et al., 1997; Li-Sucholeiki et aL, 1999). 

In our preliminary point mutation studies, we have successfully amplified all of 
the 1 1 target sequences under the standard PGR conditions, after adjusting for the 
temperature and time of reannealing or including a post-PGR incubation step. We thus 
expect that the optimal PGR conditions will be determined for the majority of the 1 50 
20 target at this stage. For the remaining target sequences that do not work, we wdll 
proceed to the third step of optimization. This step will involve changing the 
components of the PGR mixtures, such as the concentrations of the magnesium, dNTPs 
and/or the primers. Finally, if this fails, we will redesign the primers and repeat the 
optimization process. Sofiware for data evaluation will be developed in the Klarger 
25 laboratory during the first year to aid in rapid semi- automated evaluation of the PGR 
product peaks on GDGE. This software will accelerate the PGR optimization process, 
particularly for the second and third stages. 

The GDGE conditions for each target sequences will be optimized using the 
24-capillary array GDGE instrument as previously described. Tne optimal CDGE 
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condiiions for collecting point mutation-containing heteroduplexes should allow all of 
the heteroduplexes to coalesce into a single fraction well separated &om the wild type 
homoduplex, while the optimal conditions for isolating point mutation-containing 
homoduplexes should provide the greatest resolution among the observed homoduplex 
5 neaks. These optimal conditions will be determined by first running the test samples at 
various separation temperatures and then by lowering the electric field strength and/or 
using a linear polyacrylaraide matrix with high salt concentrations (Khrapko et al., 
1996). While lowering the field strength will increase the separation time, this will not 
be a concern since those sequences required a lower field strength can be processed 
1 0 together using the 12-capillar>' array CDCE instrument connected to an automated 

fi-action collection system. Evaluation of the data to select the optimum temperature will 
be automated by software to make the required measurements of resolution and peak 
shape. 

Isolation of individual target sequences from the pooled DNA sample. 

15 A number of cell equivalent of 1000 cells per donor sample in a pooled s^ple 

will be required for point mutation analysis of each target sequence to remove any 
significant variation due to cell number. Thus the number of total WBC in the pooled 
sample containing 50,000 donors would be 5 x 1 0"^ cells which is equivalent to 300 g of 
genomic DNA. Such large amount of genomic DNA cannot be directly subjected to a 

20 routine hifi PGR procedure. Furthermore, we will orJy pool 2x10^ WBC (or 0.25 ml of 
blood) fi-om each donor at this point. Thus one could think of our pooled blood samples 
containing material to study only 2000 DNA. sequences or about 20 genes, which are 
certainly not enough for the proposed set of genes plus future point mutation 
identifications. 

25 To overcome these problems, we will enrich individual target sequences fi-om the 

pooled genomic DNA sample using our recently developed technology based on 
sequence-specific hybridization coupled with biotm-streptavidin capture systems 
(Li-Sucholeiki et al., 1999). This enrichment step cannot only significantly reduce the 
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DNA sample size permitting the subsequent PGR, but also allow us to isolate multiple 
target sequences from the same genomic DNA sample, thus greatly increasing the 
number of sequences that can be studied. 

In this approach, the pooled genomic DNA sample is first digested with suitable 
5 restriction endonucleases to release the DNA fragment containing the target sequence. 
The digested DNA is then hybridized simultaneously to excess biotin-labeled oligo 
DNA probes complementary to the Watson and Crick strands of the target sequence 
which is embedded in the restriction fragment. The hybrids are then captured by 
streptavi din- coated microspheric beads. Alternatively, the hybridization can be done 

10 with biotinylated probes pre-immobilized on the streptavi din-coated beads. In either 
case, the hybrid-bound beads are separated from the bulk DNA solution by 
centrifugation or by applying a magnetic field if the beads have paramagnetic properties. 
The beads are washed under. stringent conditions to remove nonspecific binding. The 
washings are combined with the bulk DNA solution. The probe-bound target sequence is 

15' eluted from the beads into deionized H20 by heating. The elution can be directly 

amplified by hifi PGR. We have applied this method to enriching a digested APC gene 
fraement. A 1 0,000-fold enrichment and over 70% recovery was achieved for this target 
sequence (Li-Sucholeiki et al, 1999). 

To enrich multiple target sequences from the same genomic DNA saraple, 

20 different tj'p-s of streptavidin-coated beads, each containing a separate pair of probes for 
a different target sequence, will be used to hybridize simultaneously with the genomic 
DNA in the same reaction. After hybridization the different types of beads will be 
separated from the DNA solution and then from each other. After washing, individual 
target sequences will be separately eluted from each type of beads. This procedure can 

25 be repeated for the same bulk DNA solution to enrich another set of target sequences. 
We have demonstrated the use of paramagnetic beads and non-magnetic beads together 
to'enrich four different target sequences from the same genomic DNA sample 
(Li-Sucholeiki et al., unpublished data). 
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The strategy for restriction digestion of the pooled genomic DNA sample at this 
point is to choose a "e-cutter" restriction endonuclease to reduce the high molecular 
weight genomic DNA into a reproducible set of fragments averaging about 40O0 bp in 
leneth. While theoretically we could choose any 6-cutter, such a step may cut a desired 

■ 5 sequence. We need 140 bp intact to amplify any desired 100 bp sequence after restriction 
digestion. By chance any 140 bp sequence will be cut in 140/4096 = 3.4% of the time. 
But Murphy's Law predicts that some very important sequence wll eventually be found 
to be destroyed no matter what 6-cutter is chosen. Thus we will use tv^'o separate 
6-cutters on the two Master Aliquot samples and process them in parallel thereafter. The 

1 0 chance that both endonucleases would cut a particular 1 40 bp sequence is only 
0.001 168. Based on our preliminaiy results, we can in principle isolate up to 1 OOO 
separate sequences from each Master AUquot containing an average of 1000 cells from 
each donor. Tnus the average 2x10^ WBC obtained from each donor would pemiit 
study of up to 2 milHon separate 100-bp sequences or as many as 100,000 genes. 

15 Isolation of individual target sequences will be performed at MIT. 

Limited amplification, attachment of clamp and fluorescein label. 

- The target sequence in the target-enriched sample will be amplified by 
hish-fidelity PCR using Pfu polymerase and primers flanking the target sequence. One 
primer is simply 20 bp in length. The other primer is consisted of 60 base pairs 

20 including a 20-bp target specific sequence and a 40-bp non-monotonous GC sequence. 
This primer is labeled 5' with a fluorescein molecule. Thus the PCR product molecules 
will contain the desired 1 00 bp target sequence in a double stranded molecule with a 
contiguous high melting temperature domain or "clamp" which is necessary for 
achieving separation on the basis of differences in melting temperatures in the desired 

25 low melting domain. The fluorescein tag permits measurement of the number of 
molecules in a CDCE peak using laser induced fluorescence detection. 

Because PCR, even with our high fidelity conditions using Pfu DNA polymerase, 
creates mutations in the product molecules, care must be taken that these PCR created 
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mutants do not interfere with the observation- and enumeration of point 
mutation-containing sequences. The PCR induced mutant fraction, PCRmf, is the 
product of the number of PCR doublings, d, the number of base pairs in the target 
sequence, b, and the error rate in terms of mutations per base pair per doubling, f 
5 (Keohavong and Thilly, 1988). 

PCRinf = bxfxd 

For Pfu DNA polymerase under our conditions f = 6 x 10*', and our target sequence is 
100 bp. Thus if we amphfied some 64x or 6 doublings (approximately 9 cycles), we 
would exT^ect that (100) (6 x 10'^) (6) = 36 x 10'- of the products would contain PCR 

10 induced mutations. These Pfu-induced mutations are, generally distributed over 1 0 
distinct hotspot mutations each with a mutant fraction of about 3.6 x 10'^ after 6 
doublings as proposed (Andre et al,, 1997). Since even a single specific SNP in 100.000 
alleles would have a mutant fraction of 5 x 10'^ it would be obser\^ed as a clear peak 
separate from the PCR "noise". A point mutation present in 0.1% of all persons would 

15 be represented 10 times and produce a mutant fraction of 50 x 10'^ "towering" over any 
PCR noise. 

For an accurate quantification of the allelic fractions of the SNPs and particularly 
the rare point mutations, we will introduce the two artificial mutants into the 
target-enriched sample prior to PCR at a known mutant fraction of 5 x 10"^ to serve as 
20 internal. standards (Khrapko et al., 1997). The first hifi PCR step with 6 doublings will 
create 6 x 10^ copies (= 60 x 100,000 alleles x 1000 copies/allele) of fiuorescently 
labeled target sequences. The PCR sample will be boiled and reannealed to convert all 
point mutation-containing mutant sequences into mutant/wild-type heteroduplexes. This 
hifi PCR step will be performed at MIT. 

25 Identification of point mutations and rare point mutations using automated unit 
processes. 

Some 10% of the PCR product sample (about 6-x 10^ copies) will be separated on CDCE 
at a suitable condition, and the heterodplexes will be collected in a single fraction 
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separate from the wild-type homoduplex. We expect that this first CDCE collection will 
eiirich the mutants 20-fold against the wild type sequences. The heteroduplex fraction 
will be amplified by hifi PCR. A second heteroduplex collection by CDCE and hifi PCR 
will be performed to further enrich the mutants by about 5-fold. After the above 
5 procedures, rare point mutations at an initial allelic fraction of 5 x 10"^ will be present at 
5x10-^ in the PCR products/which can be visualized on CDCE. The initial allelic 
fraction of a SNP can be determined by comparison to the intemal standard (Khrapko et 
al., 1997). Individual SNP-containing peaks will be isolated, amplified to create large 
numbers of labeled molecules, checked for homogeneity (little peaics hidden under big 

10 peaics), and finally, sequenced. 

Enrichment and isolation of mutants will be performed with the 12-capinaiy 
array CDCE instrument equipped with automated fraction collection for each column 
which in turn interfaces with an automated collection transferring system for.,hifi PCR. 
Tne temperature for each column will be set at the optimized condition pre-determined 

15 -for each target sequence. This system, which allows 12 different target sequences to be- 
processed simultaneously, is essential for our high throughput. Sequencing of the 
isolated mutants will be done using the 8-capillaiy array CE instrument. 



Tests for bias. " ■ 

We re-emphasize here that any and all rare point mutations identified in any 

20 pooled samples will be re-assayed using separated Watson and Crick strands. This is 
necessaiy in any assay for low frequency mutations since they may arise from mismatch 
intemiediates in cells undergoing DNA s>mthesis at the time of sampling or as DNA 
damaged sites which were converted into mutations during PCR. These kinds of errors 
produce asymmetric distributions of apparent mutation on opposite DNA strands at the 

25 same position. True mutations produce symmetrical and complementing mutations. We 
introduced this control step in our studies of mitochondrial mutation and found two 
examples attributable to mismatch intermediates in rapidly growing cells in culture 
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(Khrapko st aL, 1997). To the best of our laiowledge the use of this essential control step 
is unique to our laboratory. 

Discover which genes have obligatory knockout alleles at fractions that indicate 
the importance of such genes in detennining reproductive fitness. Identify and note the 
fraction of genes scanned which appear to carry either dominant or recessive deleterious 
alleles. 

Our expectation is that if a gene carries only non-deleterious alleles, there will be 
some obligator knockout alleles in the set of inherited mutations with allehc fractions 
greater than 1% with a sum of about 10%. If a gene carries recessive deleterious alleles 
we expect to see the simi of obligatory Icnockout alleles to be around 0.7% or so. If a 
-gene carries dominant deleterious alleles we expect to see no obligatory Icnockout alleles 
in any ethnic group with an allelic fraction as high as 5/100,000 our limit of detection. 
The detemiination of a reproducible fine map as proposed will mean different things to 
different scientists. It will give the first quantitative distribution of rare inherited alleles. 
It will provide some indication of which genes in the DNA replication complex carry 
dominant or recessive deleterious alleles. The observation of missense mutations in 
these genes may suggest that they could be involved in mutator syndromes which 
increase the spontaneous rate of somatic mutation. Such conditions would be expected 
to hasten the appearance of cancer or atherosclerosis, i.e., they would be harnxfiil alleles. 

We will create a readily accessible web f.le of our data as we obtain them as well 
as submitting observations for publications in reviewed journals. Our data set will 
include links to structural information as well as genomic sequences for each gene. 
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V^^Tiile this invention has been particularly shovm and described v,dth references 
to preferred embodiments thereof, it will be understood by those skilled in the art that 
20 various changes in fomi and details may be made therein without departing from the 
scope of the invention encompassed by the appended claims. 
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CLAIMS 

What is claimed is: 

1 . A method for identif>dng inherited point mutations in a target region of a 
genome, comprising providing a pool of DNA fragments isolated from a 
population, and 

a) amplifying said target region of each of said fragments in a high fidelity 
polymerase chain reaction (PGR) under conditions suitable to produce 
double stranded DNA products which contain a terminal high 
temperature isomelting domain that is labeled with a detectable label, and 
where the mutant fraction of each PCR-induced mutation is not greater 
than about 5 x 1 0'^ 

b) melting and reanneaiing the product of a) imder conditions suitable to 
form duplexed DNA, thereby producing a mixture of wild type 
homoduplexes and heteroduplexes which contain point mutations; 

c) separating the heteroduplexes from the homoduplexes based upon the 
•differential melting temperatures of said heteroduplexes and said 
homoduplexes and recovering the heteroduplexes, thereby producing'a 
second pool of DNA that is enriched in target regions containLng point 
mutations; 

d) ampHfying said second pool in a high fidelity PGR under conditions 
where only homoduplexed double stranded DNA is produced, thereby 
producing a mixture of homoduplexed DNA containing wild type target 
region and homoduplexed DNAs which contain target regions that 
include point mutations; 

e) resolving the homoduplexed DNAs containiiig target regions which 
include point mutations based upon the differential melting temperatures 
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of the DNAs, and recovering the resolved DNAs which contam a target 
region which includes point mutations; and 
f) sequencing the target region oi^the recovered DNAs to identify point 
mutations within the target region. 

5 2. The method of Claim 1 wherein said population comprises at least 1 000 
individuals. 

3. The method of Claim 1 wherein said population comprises at least 10,000 
individuals, 

4. Tne method of Claim 1 wherein said population comprises between about 10,000 
10 and about 1,000,000 individuals. 

5. The method of Claim 1 wherein said population is a population of humans. 

6. Tne method of Claim 5 wherein said human population consists of members of 
the same demographic group. 

7. The method of Claim 6 wherein said hiiman population consists of individuals of 
15 European ancestry or a subgroup thereof. 

8. The method of Claim 6 wherein said human population consists of individuals of 
African ancestry or a subgroup thereof. 

9. The method of Claim 6 wherein said human population consists of individuals of 
Asian ancestry or a subgroup thereof. 
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10. The method of Claim 6 wherein said human population consists of individuals of 
Indian ancestry or a subgroup thereof. 

11. The method of Claim 1 wherein said pool is enriched in fragments containing 
said target region. 

5 12. The method of Claim 1 wherein said high fidelity PCR is catalyzed by a high 
fidelity polymerase, and wherein the copy number of each target region in a) is 
doubled a maximum of about 6 times. 

13. The method of Claim 1 wherein the heteroduplexes are separated from the 
homoduplexes ia c), and the homoduplexed DNAs are resolved in d) by constant 

10 denaturing gel capillary electrophoresis, constant denaturing gel electrophoresis, 

denaturing gradient gel electrophoresis or denaturing high performance liquid 
chromatography. 

14. The method of Claim 1 wherein the heteroduplexes are separated from the 
homoduplexes in c), and the homoduplexed DNAs are resolved in d) by constant 

15 denataring gel capillary electrophoresis. 

15. The method of Claim 1 w^herein the target region is an isomelting domain 
consisting of about SO to about 3,000 base pairs (bp). 

16. The method of Claim 1 wherein the target region is about 80 to about lOOO bp. 

17. The method of Claim 1 wherein the target region is about 100 to about 500 bp. 



20 18. 



Tne method of Claim 1 wherein the target region an exon or portion thereof of a 
protein encoding gene. 
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1 9. The method of Claim 1 wherein the target region spans the junction of an intron 
- and an exon of a protein encoding gene. 

20. The method of Claim 1 wherein the target region is a regulator}^ region of a 
protein encoding gene. 

21. The method of Claim 1 wherein the target region is a intron of a protein encoding 
gene. 

22. The method of Claim 1 wherein said target region is in an gene which encodes 
RNA. 
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23. A method for idsntif>dng genes which carry a hamiful allele, comprising: 

a) identiiying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, determining the 
frequencies with which each point mutation occurs, and calculating the 
sum of the frequency of all point mutations identified for each gene or 
segment; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, determining the 
frequencies with which each point mutation occurs, and calculating the 
sum of the frequencies of all point mutations identified for each gene or 
segment; 

c) comparing the sum frequency of point mutations which are found in a 
selected gene or portion thereof of the young population calculated in a) 
with the sum frequency of point mutation which are foimd in the same 
gene or portion thereof of the aged population calculated in b), wherein a 
significant decrease in the sum frequency of point mutations in the aged 
population indicates that said selected gene carries a harmful allele. 

24. An isolated nucleic acid which is complimentarj^ to a strand of a gene or allele 
ihereof identified by the method of Claim 23. 
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25. A method for identifying genes which carry a hannful allele, comprising: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, and determining 
the frequencies with which each point mutation occurs; 
5 b) identifying the inherited point mutations which are found in the genes or 

portions thereof of a population of aged individuals, and determining the 
frequency with which each point mutation occurs; and 

c) comparing the frequency of each point mutation identified in a selected 
gene or portion thereof of the young population determined in a) with the 

jQ frequency of the same point mutations identified in said selected gene of 

the aged population determined in b), wherein a significant decrease in 
the frequency of tsvo or more point mutations hi said selected gene of the 
aged population relative to said selected gene of the young population 
indicates that said selected gene cairies a harmful allele, 

15 26. The method of Claim 25 further comprising: 

d) detemaining the frequency of said two or more point mutations which 
decrease in the aged population in said selected gene of one or more 
intermediate age-specific populations; 

e) determining the age-specific decline of said two or more point mutations; 

20 and 

f) comparing the age-specific decline determined in e) with the theoretical 
age-specific decline of harmful alleles which cause mortal diseases, 
X(h,t), and determining if the functions are significantiy different, 
wherein a determination that the age-specific decline determined in e) is 

25 not significantiy different from the theoretical age-specific decline of 

hannful alleles which cause one or more mortal diseases further indicates 
"that said selected gene carries a hannful allele and has a high probability 
of being causal of said one or more mortal diseases. 
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27. The method of Claim 26 further comprising: 

g) • determining the frequency of said two or more point mutations which 

decrease in the aged population in said selected gene of one or more 
proband populations; and 

h) comparing the frequencies of said two or more point mutations in said 
selected gene or portion thereof in the young population with the 
frequencies of said two or more point mutations in said selected gene or 
portion thereof in the proband populations; wherein a significant increase 
in the frequencies of said one or more point mutations in the proband 
population relative to the young population indicates that said gene caries 
a harmful allele that plays a causal role in said disease. 

28. The method of Claim 26 further comprising: 

g) determining the frequency of said two or more point mutations which 
decrease in the aged population in said selected gene of one or more 
proband populations consisting of individuals with early onset disease; 
and 

h) comiparing the frequencies of said two or more point mutations in said 
selected gene or portion thereof in the young population with the 
frequencies of said two or more point mutations in said selected gene or 
portion thereof in the proband populations; wherein a signiilcant increase 
in the frequencies of said one or more point mutations in the proband 
population relative to the young population indicates that said gene 
carries a harmful allele which is a secondaiy^risk factor which accelerates 
the appearance of disease. 



29. 



An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 25. 
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0. * An isolated nucleic acid which is complimentary to a strand of a gene or allele 

thereof identified by the method of Claim 26. 

1 . An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 27. 

2. An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 28. 
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33. A method for identifying genes which carry a harmful allele or which are linked 
to a gene that carries a harmfu.] allele, comprising: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, and determining 

5 the frequency with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, and determining the 
frequency with which each point mutation occurs; 

c) comparing the frequency of each point mutation identified in a selected 

2 0 gene or portion thereof of the young population determined in a) with the 

frequency of the same point mutations identified in said selected gene of 
the aged population determined in b), wherein a significant decrease in . 
the frequency of a point mutation in said selected gene of the aged 
population relative to said selected gene of the young population 

15 indicates that said selected gene carries a harmful allele or is linked to a 

gene that carries a harmful allele. 

34. An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 33 . 

35. A method of identifvdng genes which carry a harmful allele that is a secondary 
20 risk factor that accelerates the appearance of a disease, comprising: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of an early onset proband population, and deteimining 
the frequency with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
25 portions thereof of a late onset proband population, and detennrning the 

frequency with which each point mutation occurs; 

c) comparing the frequencies of point mutations which are found in a 
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selected gene or portion thereof in the early onset proband population 
with the frequencies of the same pouit mutations in said selected gene or 
portion thereof of the late onset proband populations; wherein a 
significant increase in the frequencies of one or more point mutations m 
the early onset proband population relative to the late onset proband 
population indicates that said gene carries a hannful allele which is a 
secondary risk factor which accelerates the appearance of disease. 

36. .^n isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 35. 



10 37. 



15 



A method of identifying genes which canies a harmful allele that is a secondary 
risk factor that accelerates the appearance of a disease, comprising: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of an early onset proband population, determining the 
frequency with which each point mutation occurs, and calculating the 
sum of the frequency of all point mutations identified for each gene or 
segment; 

b) identif^nng the inherited point mutations which are found in the genes or 
portions thereof of a late onset proband population, and determining the 

■ frequency with which each point mutation occurs, and calculating the 
stun of the frequency of all point mutations identified for each gene or 
segment; 

c) comparing the sum frequency of point mutations which are found in a 
selected gene or portion thereof of the early onset proband population 
calculated in a) with the sum frequency of point mutation which, are 

25 found in the same gfene or portion thereof of the late onset probamd 

population calculated in b), wherein a significant decrease in the sum 
frequency of point mutations in the late onset proband population 



20 
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indicates that said selected gene carries a harmful allele which is a 
secondary risk factor that accelerates the appearance of a disease. 

38, An isolated nucleic acid which is complimentaiy to a strand of a gene or allele 
.thereof identified by the method of Claim.37. 

5 39. A method, for identifying genes which carry an allele which increases longe\dty, 
comprising: 

a) identifying the. inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, determining the 
frequencies with which each point mutation occurs, and calculating the 

I g of the frequency of all point mutations identified for each* gene or 

segment; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, detemiining the 

_ frequencies with which each point mutation occurs, and calculatmg the 
^5 sum of the frequencies of all point mutations identified for each-gene or 

segment; 

c) ' ' comparing the sum frequency of point mutations which are foimd in a 

selected gene or portion thereof of the young population calculated m a) 
with the sum frequency of point mutation which are found in the same 
20 ■ gene or portion thereof of the aged population calculated in b) , wherein a 

simificant increase in the sum frequency of point mutations in the aged 
population indicates that said selected gene carries an allele wiiich 
increases longevity. 

40. An isolated nucleic acid which is complimentary to a strand of a gene or allele 
25 thereof identified by the method of Claim 39. 
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41 . A method for identifying genes which carry an allele which increases longevity, 
comprising: 

a) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of young individuals, and determining 
the frequencies with which each point mutation occurs; 

b) identifying the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, and determining the 
frequency with which each point mutation occurs; and 

c) comparing the frequency of each point mutation identified in a selected 
gene or portion thereof of the young population detennined in a) with the 
frequency of the same point mutations identified in said selected gene of 
the aged population determined in b), wherein a significant increase in the 
frequency of two or more point mutations in said selected gene of the 
aged population relative to said selected gene of the young population 

■ indicates that said selected gene carries an allele which increases 

longevity. 



10 



42. 



An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 41 . 
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43. A method for identifying genes which carry ah allele which increases longevity 
or which are linked to a gene that increases longevity, comprising: 

a) • identifying the inherited point mutations which are found in the genes or 

portions thereof of a population of young individuals, and detennining 
5 the frequency with which each point mutation occurs; 

b) identifj'ing the inherited point mutations which are found in the genes or 
portions thereof of a population of aged individuals, and determining the 
frequency with which each point mutation occurs; 

c) comparing the frequency of each point mutation identified in a selected 

I Q pene or portion thereof of the young population determined in a) with the 

frequency of the same point mutations identified in said selected gene of 
the aged population detemiined in b), wherein a significant increase m the 
frequency of a point mutation in said selected gene of the aged population 
relative to said selected gene of the young population indicates that said 

J 2 selected sene carries an allele which increases longevity or is linlced to a 

gene that increases longevity. 

44. An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 43 . 
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45. A method for identiiying genes which affect the incidence of a disease, 
comprising: 

a) identifying the inherited point mutations which are found in genes or 
portions thereof of a population of young individuals not afflicted with 
said disease, determining the frequencies with which each point mutation 
occurs, and summing the frequency of all point mutations identified in 
each gene or segment thereof; 

b) identifying the inherited point mutations which are found in genes or 
portions thereof of a proband population having said disease, determining 
the frequencies with which each point mutation occurs, and summing me 
frequency of all point mutations identified in each gene or segment 
thereof; 

- c) comparing the sum frequency of point mutation in a selected gene or 

portion thereof in the young population with the sum frequency of point 
mutations in said selected gene or portion thereof in the proband 
population; wherein a significant increase in the sum frequency of pomt 
mutations in the proband population indicates that said gene plays a 
causal role in said disease. 

46. The method of Claim 29 wherein said disease is a mortal disease. 

47. An isolated nucleic acid which is compHmentary to a strand of a gene or allele 
thereof identified by the method of Claim 45. 
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48, A method for identifying a gene which carries deleterious alleles, comprising 

a) identifying the inherited point mutations occurring in the exon(s) and 
splice sites of said gene of a population of young individuals; 

b) identifying, the subset of point mutations in a) that are obligatory 

5 knockout point mutations, and determining the frequencies with which 

each obligatory knockout point mutation occurs; and • 

c) summing the H*equency of all obligatory knockout point mutations 
identified in the gene; wherein a sum frequency of less than about 2% 
indicates that said gene carries a deleterious allele. 

1 0 49. >ui isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim. 48. 

50. A method for identifsang a gene which carries deleterious alleles, comprising 
a) identifying the inherited point mutations occurring in the exon(s) and 
splice sites of said gene of a population of young individuals; 
2 5 identifying the subset of point mutations in a) that are obligatory 

knockout point mutations, and determining the frequencies with which 
each obligatory knockout point mutation occurs; 

c) identifying the subset of point mutations in a) that are presumptive 
knockout point mutations, and deteraiining the frequencies with which 

20 each presumptive knockout point mutation occurs; and 

d) summing the frequency of all of said obligatory knockout point mutations 
and presumptive knockout point mutations identified in the gene; wherein 
a sum frequency of less than about 2% indicates that said gene carries a 
deleterious allele. 

25 51. The method of Claim 32 wherein a sum of about 0.02% to about 2% indicates 
that said gene carries a recessive deleterious allele. 
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52. Tiie method of Claim 32 wherein a sum of less than about 0,02% indicates that 
said gene carries a dominant deleterious allele. 

53. An isolated nucleic acid which is complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 50. 
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54. A method for isolating and identiiying a target region of a genome which 
contains inherited point mutations, comprising providing a pool of DNA 
fragments isolated from a population, and 

a) amplifying said target region of each of said fragments in a high fidelity 
5 pobanerase chain reaction (PCR) under conditions suitable to produce 

double stranded DNA products which contain- a terminal high 
temperature isomelting domain that is labeled with a detectable label, and 
where the mutant fraction of each PCR-induced mutation is not greater 
than about 5 X 10'^ 

J Q ^) mslring and reannealing the product of a) under conditions suitable to 

form duplexed DNA, thereby producing a mixture of wild type 
homoduplexes and heteroduplexes which contain point mutations; 

c) separating the heteroduplexes from the homoduplexes bas ed upon the 
differential melting temperatures of said heteroduplexes and said 
homoduplexes and recovering the heteroduplexes, thereby producing a 
second pool of DNA that is enriched in target regions containing point 
mutations; 

d) amphfying said second pool in a high fidelity PCR under conditions 
- where only homoduplexed double stranded DNA is produced, thereby 

producing a mixture of homoduplexed DNA containing wild type target 
region and homoduplexed DNAs which contain target regions that 
include point mutations; 

e) resolving the homoduplexed DNAs containing target regions which 
include point mutations based upon the differential melting temperatures 
of the DNAs, and recovering the resolved DNAs which contain a target 
region which includes point mutations; and 

f) sequencing the target region of a recovered DNA which contain a target 
region which include point mutations. 



15 



20 



25 
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55 . An isolated target region of a genome which contains an inherited point 
mutations isolated by the method of Claim 54. 

56. An isolated nucleic acid which is .complimentary to a strand of a gene or allele 
thereof identified by the method of Claim 54. 

5 57. An array of isolated nucleic acids, immobihzed on a solid support, said array 

having at least about 100 different isolated nucleic acids which occupy separate 
known sites in said array, wherein each of said different isolated nucleic acids 
hybridizes to a target region which contains an inherited pomt mutation of Claim 
54. 

10 58. An array of isolated nucleic acids, immobilized on a solid support, said array 

ha^'ing at least about 100 different isolated nucleic acids which occupy separate 
knovvTi sites in said array, wherein said array comprises all known deleterious, 
harmful and beneficial point mutations for all human populations. 
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59. A method of identii>dng t±ie inherited point mutations in any target region of a 
genome of a population, wherein said point mutations 

a) interfere with reproduction; 

b) cause or accelerate the appearance of a mortal disease; or 

5 c) prevent or delay the appearance of a mortal disease; wherein 

the set of all inherited point mutations occurring at a frequency at or above 5 
xl 0-5 is first identified separately in members of the same population who 
comprise subpopulations selected from the group consisting of young, aged, 
intemiediate age, afflicted with disease, afflicted with a disease of early age onset 
10 and afflicted with a disease of late age onset, by noting .the frequencies of each 

inherited point mutation within and between the subpopulations. thereby 
detsnnining which inherited point mutations are deleterious, harmful or beneScial. 
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